Des Rad Gettingstarted Project Overview

8/14/2019 Des Rad Gettingstarted Project Overview

1/30

Architecture for Biomolecular Simulation:Project Overview

Document Revision History

Description Author Rev. No. Date

Initial version Jason Crawford 1.0 01 June 2004First-pass revision, incorporatingarchitectural changes

Doug Ierardi 1.1 12 September 2005

Updates based on comments from RonDror, David Kuo and Anne Weber, andadditions on DESMOND from EdmondChow.

Doug Ierardi 1.2 06 October 2005

This document gives an overview of a project at DESRAD to design and build a new type ofcomputing architecture for biomolecular simulation. It gives background for the project, providingmotivation for the problem domain and describing basic experimental and computational methodsfor these problems, including the concept of a molecular force field. It discusses computationalchallenges of the problem domain and techniques that researchers have developed to addressthem. It then describes the major elements of DESRADs current approach to these challenges.Finally, it mentions open issues and briefly addresses business- and project-level concerns.

Architecture for Biomolecular Simulation: Project Overview..............................................................11 Introduction .....................................................................................................................................32 Background .....................................................................................................................................3

2.1 Experimental and computational methods in drug development and protein research ... ....42.1.1 Experimental methods ..............................................................................................42.1.2 Computational methods ...........................................................................................5

2.2 Molecular force fields ...........................................................................................................72.3 Computational challenges and techniques of force-field evaluation and moleculardynamics ...................................................................................................................................8

2.3.1 Approximation methods for long-range interactions .................................................92.3.2 RESPA and multiple-timestep methods ...................................................................112.3.3 Constraints .............................................................................................................112.3.4 Parallelization .........................................................................................................112.3.5 Specialized hardware ..............................................................................................14

3 DESRADs approach .......................................................................................................................183.1.1 DESMOND ...............................................................................................................18

3.1.2 Target molecular systems ......................................................................................193.2 Architectural overview .......................................................................................................193.3 Details of the design ..........................................................................................................20

3.3.1 The Midrange subsystem ........................................................................................203.3.2 The Distant computation: grid-particle interactions and the FFT subsystem ........ ..223.3.3 The Flexible subsystem ..........................................................................................243.3.4 The Memory subsystem ..........................................................................................253.3.5 The Timestep ..........................................................................................................253.3.6 Hardware implementation of the processing node .................................................263.3.7 The Communication subsystem and interconnect network ....................................27

3.4 Other major systems issues ...............................................................................................27

CONFIDENTIAL AND PROPRIETARY TO D. E. SHAW R & DNOT FOR EXTERNAL DISTRIBUTION


2/30


4 Open science and simulation issues ..............................................................................................285 Business- and project-level concerns .............................................................................................30

2 of 30


3/30


1 IntroductionD. E. Shaw Research and Development (DESRAD) is investigating novel computing architectures forcertain problems in computational biochemistry. Our specific target problem domain is that ofstructure-based problems, such as protein structure prediction, protein-ligand interaction, searchingthrough a virtual combinatorial library of potential drug compounds, and optimizing existing drugs.

Over the next several years, we plan to design and build a high-performance computingarchitecture to accelerate molecular force field evaluation and molecular dynamics simulations. Weexpect this supercomputer to have a massively parallel architecture, including programmableprocessors and hardwired data paths on specially designed application-specific integrated circuits(ASICs), with processing nodes connected by a high-speed three-dimensional toroidal meshnetwork.

If successful, this architecture will enable certain structure-based problems in biochemical researchand drug development to be solved with sufficient accuracy in a reasonable amount of time, thusbringing computational methods for these problems into the realm of usability.

2 BackgroundBecause our most immediate applications involve proteins, we begin with a brief introduction to

protein science.Proteins play a wide variety of important roles in living organisms, functioning as structuralcomponents, as enzymes and as parts of cell signaling pathways. When proteins operateimproperly, they can cause diseases, as in the case of Bovine Spongiform Encephalopathy or MadCow Disease for example. When proteins operate improperly, they can cause diseases, as in thecase of Bovine Spongiform Encephalopathy or Mad Cow Disease for example. Because of theirimportance in our bodies, proteins are the targets of many drugs. By binding to a certain type ofprotein in the body, such drugs thereby affect the proteins activity, usually by either activating ordeactivating it, thus intensifying or inhibiting its activity. For these reasons, the interactions ofproteins with each other and with other molecules are of great interest to the pharmaceuticalindustry and to researchers in biochemistry.

Protein interactions depend in large part on the structure of the molecules involved. The binding

between a protein and another molecule is based on the physical structure of the moleculesproteins sometimes have docking sites or pockets into which other molecules, called ligands, fitand on the electrostatic and other interactions between the molecules. (Note that theseinteractions typically are not chemical bonds, but rather much weaker physical interactions.)

Key to the understanding of protein interactions is knowledge of protein structure and of thechemistry and physics that arise from that structure. This is nontrivial because protein structuresare complex. The building blocks of a protein are amino acids. A protein typically contains hundredsof amino acids, chained together in a linear sequence. A protein, however, has structure beyond thesequence of amino acids that make it up; this sequence is merely itsprimary structure. One part ofan amino acid chain (also called apolypeptide chain) may coil up into a spiral (called an alphahelix); another part may fold back and forth on itself to form a long, flat sheet (called a beta sheet).These features are termed secondary structure. These components then fold up among each otherinto structured globular regions called domains; this is the tertiary structure of a protein. Finally, insome larger proteins, multiple polypeptide chains will interlock, possibly including non-polypeptidestructures as well (such as the heme group in hemoglobin). This is quaternary structure. The resultis a complex entity, typically globular in shape.

3 of 30


4/30


5/30


synthesize it. Determining an appropriate procedure can itself require a trial-and-error process,which is often performed by trained technicians and supervised by PhDs.

2.1.2 Computational methods

Computational methods for developing drugs and studying proteins are of interest primarilybecause they have the potential to overcome the limitations of experimental methods. Computermethods may be able to determine protein structures that have so far eluded experimental

methods; or quickly screen enormous virtual libraries of compounds for activity; or aid in theoptimization of drugs and other molecules. This section describes several problems within therealm of structure-based methods, which run computational experiments2 based on a structuralmodel of the molecules involved.(Note that not all of the problems described below are within thetarget application domain of our proposed architecture.)

Unlike experimental methods, structure-based methods require that the structure of a molecule bedetermined before it can be experimented with. In particular, the structure of a target proteinmust be determined before any ligand can be tested against it. Henceprotein structure predictionbecomes a fundamental task in structure-based methods (whereas it is a secondary task inexperimental methods). The primary structure of a protein, its amino acid sequence, is typicallyknown (for instance, from the DNA sequence that codes for the protein). The computational task ofprotein structure predictionis to use the proteins primary structure to determine its higher-levelstructure.

Some approaches to protein structure prediction, which we may call database-driven approaches,rely on finding one or more proteins with similar structure in a database of known proteinstructures. One such method, known as homology modeling, is used when an experimentallydetermined structure of one of more evolutionary related proteins (homologues) is available. Theseexperimental structures are used as templates to build a theoretical homology modelof the targetprotein. Other methods, such as fold recognition, can also make use ofanalogues, which are similarto the target protein in some way but are not related through evolution. Database-drivenapproaches can give a good approximation of the native state of a protein, but alone they aregenerally inadequate for producing structures that are sufficiently accurate to be used in othercomputational problems, such as structure-based drug design. Thus, a homology model or foldrecognition model is often refinedthrough other methods to get a more accurate structure (a non-trivial problem in itself). The techniques have other limitations: to use them, one must first havedetermined structures for similar proteins. Database-driven methods thus cannot be used todetermine the firstprotein structures in a given class of proteins. (For example, membraneproteins: owing to the difficulties mentioned above, fewer than one hundred membrane proteins3

have known structures, and only a few of these are human proteins.)

An alternative to database-driven approaches is to take a primarilyphysics-basedapproach,predicting structure based on a mathematical model of the physics of a molecular system.4 Proteinsmay be modeled at the atomic level, with individual atoms or groups of atoms represented as pointbodies in an N-body system. The most straightforward of these methods is molecular dynamics(MD), in which the force on each particle is calculated and Newtons laws are numerically integratedto predict the physical trajectory of each atom over time. An alternative is the class ofMonte Carlomethods, which stochastically sample the potential energy surface of a system. Physics-basedmethods can either be used to refine homology models or be applied on their own, to determine

2

In the jargon of biologists, experiments often are classified as in vivo (in a living organism) or invitro (in glassi.e., an experiment in a test tube or Petri dish, for example). The term in silico (insilicon) has recently become more common, to refer to experiments that are done in or by meansof computer simulation.3 To get an idea of the status of membrane structures in the Protein Data Bank, seeThe StephenWhite Laboratory at UC Irvine or the Structural Classification of Proteins. There are several, but formany of these the resolution is poor, possibly too poor for structure-based drug design.4 There is a sense, however, in which even physics-based methods are fundamentally data-driven.The forces or energies in physics-based methods are generally calculated according to some typeof parameterized model. Although some if not all of the parameters for these models cantheoretically be derived from principles of quantum mechanics, they usually come fromexperimentation and statistical fitting to data.

5 of 30
http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.htmlhttp://blanco.biomol.uci.edu/Membrane_Proteins_xtal.htmlhttp://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.g.htmlhttp://blanco.biomol.uci.edu/Membrane_Proteins_xtal.htmlhttp://blanco.biomol.uci.edu/Membrane_Proteins_xtal.htmlhttp://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.g.html


6/30


protein structure ab initio (from scratch). MD can also be used to study the process of proteinfolding.

One complicating factor in applying physics-based methods to protein structure prediction andother structure-based problems in computational biochemistry is that often these problems are notconcerned with any single molecule or molecular event, but rather with the statistical properties ofa very large collection of molecules. For instance, when predicting protein structure, we areinterested not in the structure that one particular protein molecule may happen to fold into, but the

most probable structure for molecules of that protein, the structure that will be the norm among alarge collection of molecules of that protein. When studying binding, the question is not whether aparticular protein and ligand will bind in a particular instance, but the concentrationof boundligands in the steady state, when a large number of proteins and ligands are put together.

The essential quantity to compute for these problems is the free energyof a molecular system.Free energy is not a property of a single state of a system. but of an ensemble of states. For thepurposes of this document, it suffices to say that the lower the free energy of an ensemble ofstates, the higher the probability that a molecular system will be found in one of those states at anygiven time. The free energy of an ensemble is computed by summing probabilities over all states inthe ensemble. The probability of a state is an exponential function of its potential energy thatdepends on the temperature of the system (specifically, the calculation uses a Boltzmanndistribution). In practical terms, this means that for protein structure prediction, it is insufficient toperform a single run of an MD simulation, stop at some point, and take the end conformation as the

native state. At a minimum, many conformations near the end of such a simulation must besampled. More likely, multiple separate simulations must be run and the results compared. Thereare open questions surrounding this issue, some of which will be mentioned in a later section.

Once the structure of a target protein has been obtainedthrough either computational orexperimental methodscomputational methods can be used to predict the proteins interactionswith ligands, a task loosely referred to as the docking problem.

Screening ligands for activity could in theory be done with physics-based methods. Owing to theenormous computational burden of that approach, other methods have been developed. Generally,different configurations of a protein and ligand are generated and a score is calculated for eachaccording to a heuristic scoring function. To reduce the number of configurations that must betested, the protein is held rigid or allowed a small degree of flexibility; the ligand is tested invarious poses and orientations at various points near the protein. To further reduce the

configuration space, it is necessary (or at least very helpful) to know the active site on the protein.This heuristic screening can identify ligands that are likely to bind to a target protein, but it cannotquantify that binding or predict other properties of the interaction, such as the concentration of theligand necessary to produce a certain effect. For this purpose, the binding free energyof theinteractionthe free energy difference between the unbound and bound states of the systemmust be calculated. Too small a binding energy can make the required dosage of a drug too high tobe practical; too high an energy can cause toxicity or other side effects. Accurate calculation ofbinding free energy is a significantly more computationally intensive task than simply determiningwhether two molecules are likely to bind, and relies heavily on accurate structural models of theseinteractions.

To test a wide variety of ligands against a target protein, combinatorial techniques generate vastnumbers of ligands based on simple rules. The result is called a virtual combinatorial libraryof

compounds. Heuristic searching methods can then identify the top few hundred drug candidatesout of billions of compounds.

We have chosen to focus our attention on physics-based methods for protein structure predictionand other problems in computational biochemistry, rather than data-driven or heuristic methods,owing to the limitations of the latter and to the fact that the greatest unsolved computationalchallenges involve physics-based methods. Furthermore, we give more attention to MD than toMonte Carlo methods since, to date, Monte Carlo methods have not been particularly successful forthe types of models in which we are interested.5As a consequence, some proposals for ourarchitecture include specialized hardware to accelerate MD, but none include features specifically

5 Specifically, explicit solvent models, explained below.6 of 30


7/30


targeted at Monte Carlo. The discussion in the rest of this document will often assume the contextof MD. However, by accelerating force and energy calculations, we expect our proposedarchitecture to be applicable to both classes of methods.

The foregoing discussion has highlighted selected computational methods for problems inbiochemistry. However, these are not necessarily the only problems to which we will apply thecomputing platform we plan to build. Some methods that are currently too computationallydemanding for practical use may become attractive on our architecture, such as applications of MD

to protein-ligand binding, or the design and optimization of protein structures themselves.For many of the problems just discussed, experimental methods are currently preferred overcomputational ones, in part because current computing architectures and algorithms cannotachieve sufficient accuracy to be useful in a reasonable amount of processing time. More powerfulor efficient computing systems, possibly armed with better models and simulation methods, wouldmake this goal possible. Indeed, sufficiently powerful systems, only a few orders of magnitude awayfrom todays technology, may be significantly faster and cheaper than experimental methods. Thedifference in degree of efficiency could even become a difference in kind, making it possible to testa much wider variety of ligands, or to target proteins that are prohibitively expensive or time-consuming to target with experimental methods. This vision has been a major motivator ofcomputational biochemistry for the past few decades.

2.2 Molecular force fields

As noted above, we have chosen to focus on physics-based methods for structure-based problems.The inner loop of these methods tends to solve the following problem: given a molecular system,calculate either (a) the potential energy of the system as a whole or (b) the force on each particledue to its interactions with the rest of the system. (Note that the force in (b) is simply the three-dimensional vector gradient of (a), making the two variations of the problem similarcomputationally.) Calculating forces and energies has thus become the focus of our architecture.These quantities are generally computed according to a particular type of model called a molecularforce field.

Force fields approximate at the classical level a potential energy field that is fundamentallygoverned by the laws of quantum physics. As described above in the context of physics-basedmodels, force fields model molecular systems at the atomic level, with atoms or groups of atomstypically represented as point bodies. (Force fields that model each atom individually are called all-atom force fields, in contrast to reduced models which model groups of atoms as single points.)Each point body has a set of associated parameters such as mass and charge (actually apartialcharge, so that the complicated electron distribution caused by atomic bonding can be modeledwith point charges). These parameters are determined at initialization according to the atoms typeand, in standard force fields, remain constant throughout the simulation. Atom type is not as simpleas an atomic number on the periodic table: an atoms parameters can depend upon what particlesare near it and, in particular, what other atoms are bonded to it. For example, a hydrogen atombonded to a nitrogen atom (an amine) may have a partial charge that is different than one bondedto an oxygen atom (an alcohol). There can easily be several types each of carbon, hydrogen,oxygen, and nitrogen. The set of atom types and the parameters used for them are one of thecharacteristics that define a particular force field.6

The interactions among atoms are broken into several components. The bonded terms model the

interaction between atoms that are covalently bonded. One such term effectively models a bondbetween two atoms as a harmonic oscillator, reflecting the tendency for two atoms to settle at acertain distance from each other, known as the bond length. Another term reflects the tendency fortwo bonds to bend towards a certain angle. Yet another term takes into account the effect oftorsion, the twisting of a bond owing to the relative angles it makes with two bonds on either sideof it. Several other types of terms are possible, many of them cross-terms, which take into accountthe interaction between the basic types of terms. Each term contains one or more parameters, andeach parameter must have a value for each combination (pair, triplet, quartet, etc.) of atom types,leading to a profusion of parameters from all of the terms and cross-terms.

6 See Jensen, Introduction to Computational Chemistry, p. 6, for a sample list of atom types.7 of 30


8/30


The non-bonded terms in a force field model all-to-all interactions among atoms. One of the non-bonded terms calculates the electrostatic interaction of charges attracting and repelling each otheraccording to Coulombs law. The other type of non-bonded term models the van der Waals7

interaction, a shorter-range interaction that comprises an attractive component and a repulsivecomponent. The repulsive component dominates at very short distances of a few angstroms, wherean angstrom (abbreviated ) is 1010 meters. In a rough sense, the repulsive force can be thoughtof as keeping atoms from overlapping or crashing into one another. There are various ways tomodel the van der Waals potential. The attractive component is usually modeled as dropping off as

the inverse sixth power of the distance between two particles, or 1/r6; the repulsive component canbe modeled with a similar power function (usually 1/r12, for computational convenience on general-purpose processors) or with an exponential. (Contrast this to the electrostatic potential which dropsoff more slowly, as 1/r.)

Different force fields exist, with different sets of atom types, different parameters for each type,and different terms in the force and energy equations. Many force fields have been developed inacademia, including AMBER (Assisted Model Building with Energy Refinement), CHARMM (Chemistryat HARvard Molecular Mechanics), and OPLS-AA (Optimized Potential for Liquid SimulationsAll Atom).

It is an open question, which will be discussed at length in a later section, whether this relativelysimple model of biochemistry is sufficiently accurate to solve structure-based problems. For now wemerely note that for very short-range interactions, there may be a need for specializedcomputations, perhaps including calculations at the quantum-level, the exact nature of which we

can not foresee. This will have important implications for fundamental architectural choices in ourproposed machine.

2.3 Computational challenges and techniques of force-field evaluation andmolecular dynamics

To give an appreciation for why many structure-based problems can not be handled well by existingcomputing platforms, we must detail some of the complexities of force field evaluation (i.e.,evaluation of forces or energies in a molecular system according to a force field model) andmolecular dynamics.

First, more than just the target molecule must be modeled: the molecular surroundings of thetarget molecule, the solvent, must be modeled as well. In protein folding, for example, theinteraction of the protein with the solvent is crucial to the folding process. The solvent is typicallywater, perhaps containing ions such as sodium and chlorine. In the case of membrane proteinsimulations, the lipid bilayer of the cell membrane is the solvent for most of the protein. The solventcan be modeled as a collection of particles (called explicit wateror explicit solvent), with one ormore particles used to represent each water molecule, or a continuum model (called implicit wateror continuum solvent) in which the dielectric effects of the water are simulated without using aparticle-based model. We have chosen to use an explicit solvent model in our architecture since, atleast in principle, explicit water is more accurate and maps more naturally to specialized hardware.

When explicit solvent is used,periodic boundary conditions are often applied; that is, the simulationspace is given a toroidal topology, such that an atom that moves through one face of the simulationbox appears on the opposite face. The simulation effectively becomes, not of one molecule in abox of solvent, but of an infinite number of identical molecules in as many identical solvent boxes,tiled in three dimensions. (Periodic boundary conditions are generally used to avoid artifacts that

otherwise occur in simulating a box of solvated protein in a vacuum: water molecules, beingpolarized, tend to orient themselves along the edge of the simulation boxan effect thatpropagates throughout the box.) Under periodic boundary conditions, the box is generally madelarge enough that there is sufficient water between any two copies of the target molecule to maskelectrostatic interactions; thus a protein, for instance, does not strongly interact with other copiesof itself, and folds almost as if it were in an infinite sea of water. About 10 angstroms of watersurrounding a target molecule are sufficient for this purpose.

7 Named for the Dutch physicist Johannes Diderik van der Waals and properly pronounced vn durvls.

8 of 30


9/30


When folding a protein with explicit solvent, a modest system size is on the order of 32,000particles, with about 4000 atoms in the protein and the rest in the solvent. Such a system hasabout half a billion particles pairsand as many interactions to calculate for each all-to-all force.Furthermore, the timestep used in MD simulations is on the order of a femtosecond (1015 seconds).Because the duration of a protein folding event is typically on the order of milliseconds, an MD runhas to calculate on the order of 1012 timesteps. Put together, a single protein folding MD runcomprises something close to 1020 interactions. Furthermore, as noted above, a single MD run maybe insufficient to predict protein structure or solve other problems of interest.

Due to the combination of these factors, structure-based problems present an enormouscomputational challenge. Consider the simulation just described above: 32,000 particles,calculating interactions between all pairs each timestep, for 1012 timesteps. If we estimate that 25mathematical operations are required to calculate a single pairwise interaction, then to execute thesimulation in a month requires computing power of approximately 5 petaflops (5 x 1015 flops, whereflops means roughly mathematical operations per second8). The fastest general-purposeprocessors today can perform in the range of 10 to 100 gigaflops (1 gigaflops = 109 flops); thefastest general-purpose supercomputers combine many processors to perform in the range of tensof teraflops (1 teraflops = 1012 flops).

The following sections describe three majorand orthogonalapproaches to addressing thischallenge: approximation algorithms, parallelization, and specialized hardware.

2.3.1 Approximation methods for long-range interactions

The first approach to managing the enormous computational load of MD is to apply approximationalgorithms that reduce the total number of operations required to compute the forces or energies ina molecular system. We mention here only a few of the most common methods for this purpose.

To begin, note that the number of non-bonded (electrostatic and van der Waals) interactions scalesquadratically with the number of atoms in a system, since such interactions exist between all pairsof particles. On the other hand, the number of bonded interactions scales only linearly with systemsize, since the number of atoms bonded to any given atom is bounded by a constant. (If there are Nparticles in a system, then in computer science terms, we say that calculating the non-bondedinteractions is O(N2) whereas calculating the bonded interactions is O(N).) Thus, approximationalgorithms generally focus on the non-bonded interactions. Even using such algorithms, the non-bonded interactions typically consume over 90% of the compute time for typical system sizes in MD

implementations on sequential general-purpose processors.By far the simplest approximation is what may be termed the cutoff method: ignore all interactionsbetween particles that are separated by more than a chosen cutoff distance (a typical cutoff wouldbe 1015). Distant particles typically contribute little to the total force on a given particle, since allnon-bonded interactions eventually weaken with distance. Electrostatic interactions in particularare typically subject to the electric screening effect that occurs when negatively charged particlescluster near positively charged ones. For these reasons, a summation of only the short-rangeinteractions gives an approximation to the total force or energy.

For the van der Waals potential, which falls off at large distances as the inverse sixth power of thedistance between two particles, the cutoff method is probably sufficient. However, the consensusamong researchers is that a simple cutoff is not sufficiently accurate for calculating theelectrostatic potential, which falls off only as the inverse of the distance. Nevertheless, more

sophisticated methods have been developed to calculate these potentials.The most commonly used class of methods are the Ewald9 methods. The basic Ewald method wasoriginally created before the invention of the computer. Its purpose was to calculate the potentialenergy of an infinite (in the limit) periodic array of charges (such as an ionic crystal) in a finite

8 The term flops is actually a contraction of floating-point operations per second. It wasoriginally used strictly to measure floating-point operations; however, today it is often used in themore general sense indicated here. We use the more general sense, since we are likely to usefixed-point, not floating-point, operations for many of our calculations.9 Named after the scientist who created the fundamental approach. Often pronounced EE-wald,but probably more properly pronounced AY-vald.

9 of 30


10/30


computation. It thus finds natural application to a molecule in solvent under periodic boundaryconditions. The method computes this energy by first computing the electrostatic potential fieldthroughout space, and then computes the contribution to the potential energy from each charge asthe product of the charge and the value of the electrostatic potential field at that point.

The essence of the Ewald method is to split the electrostatic potential field into two main parts, oneof which can be computed efficiently in real space (by calculating pairwise interactions) and theother of which can be handled efficiently in the Fourier domain. (Note, however, that not all

variations on Ewald use the Fourier domain.) The electrostatic potential at a point charge i is thesum of contributions from all other distinct point chargesj. In calculating the electrostatic potential,Ewald methods center two Gaussian charge distributions on each point charge. One Gaussian (thespread charge distribution) contains exactly the same charge as the point charge on which it iscentered; the second (the screening charge distribution) is identical to the first but of opposite sign.The inclusion of these pairs of oppositely charged Gaussians is a mathematical device: it has noeffect on the net electrostatic potential at any point, but it allows the potential at a given particle ito be rewritten as the sum of three contributions, as follows:

The contribution to the electrostatic potential at i due to all point chargesj i and theirassociated screening charge distributions(the real-space component).

The contribution to the electrostatic potential at i due to all the spread charge distributionsin the system, including the one centered at i itself (the reciprocal component).

The contribution to the electrostatic potential at i due to the screening charge distribution atsite i (the self-interaction).

The self-interaction remains constant throughout the simulation, and can be calculated atinitialization. The real-space component may be calculated directly by summing pairwiseinteractions between bare point charges and screened point charges. A cutoff method cansuccessfully be applied to this sum because at large distances, the electrostatic potentialcontributed by the point charge together with its screening charge distribution falls off withdistance much faster than that of a point charge alone.

The reciprocal component is calculated by convolving the sum of the spread charge distributionswith the 1/rCoulomb potential. The potential energy contributed by point charge i is thenproportional to the product of the electrostatic potential at point i and the charge at i. Theconvolution can be calculated in a number of ways, which define different sub-types of methods. It

is perhaps most natural to perform the calculation in reciprocal space. This is the approach taken inthe original Ewald method, in which a discrete Fourier transform of the spread charge distributionsallows the convolution to be performed as a multiplication in the Fourier domain. (To calculate theforces, an inverse transform is also required.) At least two other classes of methods are moreefficient. In FFT-based Ewald methods, the discrete Fourier transform (and its inverse) are replacedby a fast Fourier transform (and its inverse), which requires interpolation of the charge distributionto a three-dimensional mesh. Examples of this are the Particle-Mesh Ewald(PME), and a methodthat we have created called Gaussian-Splitting Ewald(GSE). Real-space methods, such asSuccessive Over-Relaxation (SOR) and multigrid summation, calculate the electrostatic potential inthe real domain, using iterative finite-difference techniques.

The asymptotic complexity of these different methods varies from O(N log N) for the FFT-basedapproaches to O(N) for the Fast Multipole Method of Greengard.On conventional architectures,

Ewald methods are generally the most efficient for all but very large problem sizesprobably onthe order of 105 to 106 particles. Most software implementations of MD use PME. These includeGROMACSthe fastest software package for single-processor commodity computersas well asAMBER and CHARMM.10

10 The names AMBER and CHARMM have been used above to refer to force fields, but they also referto the software packages developed by the same groups to implement MD codes using those forcefields.

10 of 30


11/30


2.3.2 RESPA and multiple-timestep methods

The integration performed during a MD simulation is limited by several numerical constraints, oneof which is the period of the fastest vibration in the system. In most force fields, these occur in theharmonic oscillators that model covalent bonds. A HydrogenOxygen bond, for example, has avibrational frequency of about 10 femtoseconds (fs). An appropriate integrator (like the variousvariants of the Verlet algorithm) requires a timestep on the order of 1fs to accurately trace theevolution of these particles positions.

A very different approach to reducing the computational load of long-range interactions is tocompute them less frequentlythan short-range or bonded interactions. This is the approach ofmultiple timestep (or MTS) methods. In MTS methods, not all timesteps are identical: in some steps,perhaps only short-range forces are calculated; but every n steps, all forces are calculated. In MTS,we break the forces into a hierarchy of several levelssay, bonded interactions, short-range vander Waals and electrostatic interactions, and long-range electrostatic interactionswhich, in thiscase, is an enumeration of the components of a force field, starting with those that change themost quickly and continuing through those that change more and more slowly. More precisely,these various components of the force field are then integrated with timesteps of different lengths:short timesteps for the quickly changing forces and longer timesteps for the slowly changing forces.In effect, with respect to the shortest of these timesteps (also called the inner timestep), wemight then compute bonded interactions every timestep, short-range non-bonded interactionsevery third timestep, and long-range interactions only every sixth timestep. MTS methods are

provably sound and are generally successful at reducing the total amount of computation requiredto simulate a fixed interval of time.

2.3.3 Constraints

Other techniques can be applied to remove some of the most quickly changing degrees of freedomfrom the system, while still giving meaningful results. One such approach involves fixing thelengths of certain bonds in the system (typically the highest frequency bonds, namely thecovalently bonded hydrogen atoms). During the integration, these bonded particles are not treatedspecifically as point objects, but rather the bonds themselves are treated as rigid objects of fixedlength. Various algorithms for handling such holonomic constraints are known, with names likeSHAKE and SETTLE. Each requires additional computation during an integration step. The gaincomes from the fact that, when the highest frequency components of the system are eliminated,the use of a larger timestep becomes feasible.

On our machine we expect to use a version of MTS with an inner timestep of approximately 2fs,where constraints applied to covalently bonded hydrogen atoms. Bonded and short-range non-bonded interactions will be calculated on every timestep, and long-range electrostatic interactionswill be calculated every third timestep.

2.3.4 Parallelization

A second major approach to accelerating force field evaluation and MD is parallelization. Force fieldevaluation has a high degree of natural parallelism. In force field models, the potential energy of asystem or the total force on an atom is the sum of a large number of terms, each of which can becalculated independently of the rest. This parallelism extends acrosspairs of particles and acrossthe different types of interactions described above: both the different types of interactions and theinteractions between different pairs of particles can be computed in parallel. In the discussion that

follows, we will assume for simplicity of exposition that we are computing the force on each particlerather than the total potential energy of the system. It should be clear how the methods apply toenergy calculation as well.

Decomposing the evaluation of a force field into separate types of interactions, computedseparately and summed at the end, is straightforward. Distributing the interactions among a set ofprocessing nodes11 in a parallel computing system is not. In this section, we describe some of the

11 When discussing an abstract parallel system, we will assume that the basic elements of thesystem are nodes which contain processors and memory and which communicate through anetwork.

11 of 30


12/30


principle methods for parallelizing the pairwise non-bonded force calculations. Note, however, thatwhen one of the long-range approximation algorithms discussed in the previous section is applied,the calculation of long-range interactions must be parallelized as well; algorithms for that problemare not discussed here.

To appreciate the differences among parallelization methods for pairwise interactions, considerwhat happens in each timestep when given a molecular system with particles in known positionsand with known velocities. First, the force field is evaluated to find the force on each particle. From

the force, the acceleration of each atom in that timestep is determined according to Newtons laws.Velocities are updated based on the calculated acceleration, and positions updated based on thesecalculated velocities. Then the next timestep begins, with this new set of particle positions andvelocities.12 In a parallel implementation of this basic timestep, it is easy to see that at least twocommunication phases may be necessary in each timestep. First, before the forces can becomputed, each node must know the positions of all particles whose interactions it will have tocalculate; thus it must receive all relevant particle positions that it does not yet have. And second,before the positions can be updated, each node must know the total force on each particle it willhave to update; thus it must receive any relevant forces that it does not have. Note that a nodemay have the partial force on a particle from some of the other particles in the system, but stillneed to receive the forces from the rest of the particles before it can compute the new position.

Thus, in any decomposition scheme for pairwise interactions, there are two basic issues. One, whichwe will call the question ofparticle assignment, is this: given an atom, which node is responsible for

updating its position? The second issue, which we will callpair assignment is that of determining,for everypairof particles (i,j), which node is responsible for calculating the force on i fromj.

If no long-range approximation method is being used, and interactions are calculated between allpairs of atoms, then particle assignment is simple and not very important. All that is necessary is todistribute the atoms evenly among the processing nodes. An atom can be arbitrarily assigned to anode at the start of a simulation and the assignment need not change over the course of thesimulation. The only issue, then, is that of pair assignment.

Under this assumption, the most straightforward approach to pair assignment is that of an atom-based decomposition, in which each node has responsibility for the forces on the atoms assigned toit from all other atoms.13 In this decomposition, each node must broadcast the positions of theatoms it owns to all other nodes on every timestep. Each node then computes the forces exerted byother atoms on each atom it owns, and sums these individual forces into a net force per atom. Such

a broadcast operation is expensive: with N atoms and P processing nodes, each node must sendN/P data and receive NN/P data over the course of the broadcast. This requires either acommunication mechanism that supports broadcast, or extra copies of messages on a point-to-point network.

A force-based decomposition uses the same particle assignment, but implements a moresophisticated (and less straightforward) pair assignment in order to minimize communication. Tounderstand this decomposition, consider the force matrix F: in an N-particle system, Fis an NNmatrix in which the entry at position (i,j) is the force exerted by particle i upon particlej. Force-based decompositions divide the force matrix into a grid of smaller tiles, each of which representsthe cross-product of a small range of values ofi andj; each tile is assigned to a processor. (For thepurpose of this discussion, let us assume that all of the tiles are simple N/Px N/P squares.)Eachprocessor then needs only a relatively small set of atom positions in order to compute the pairwiseforces for which it is responsible. However, computing the net force on each atom requires

summing across an entire column of the matrix, which in turn requires each node to send N/P data(partial sums on the portions of columns within its tile) to the other P1processors in its columns.(Note that this is smaller than the broadcast in an atom-based decomposition by a factor of aboutP). Once these column-based broadcasts are complete, and atom positions have been updated, a

12 This is a simplified overview of an MD timestep; it has additional parts which will be discussed ina later section.13 Because of Newtons third law of motion, which says that the force on atom i from atomj is theopposite of the force onj from i, it turns out that each node actually only has to calculate half of thepairs involving its atoms. We ignore this optimization here.

12 of 30


13/30


similar communication step then distributes updated atom positions to the appropriate set ofprocessors.14

If pairwise interactions are computed only between pairs that are separated by less than someinteraction radius or cutoff distanceas is the case in most molecular dynamics softwarethesituation is very different. In this scenario, the particle assignment is of crucial importance to thecommunication load. In particular, if the simulation volume is divided into boxes by a uniformspatial partition, and each node is assigned the atoms residing in one of these boxes, then each

node needs only to communicate with the nodes responsible for those boxes which are nearestneighbors to its home box (and even then, only to the extent that the interaction radius extendsinto those neighbor boxes). This approach has the potential to require far less communication andcomputation than methods which calculate interactions between every pair. Note, however, thatsince atoms move during the course of a simulation, the assignment of atoms to boxes (and hencealso to processing nodes) is not static: atoms can migrate from one node to another betweentimesteps.

A spatial decomposition uses this distribution of atoms to nodes. During any given timestep, eachnode has responsibility for the total force on all atoms that it owns. In this way, a spatialdecomposition has essentially the same pair assignment as an atom-based decomposition. Spatialdecomposition appears to have asymptotic advantages over other decomposition methods, and isthe basis of the approaches we are considering. (Remember, however, that the use of a cutoffgenerally requires an approximation algorithm to calculate long-range interactions, where such a

cutoff alone would introduce too much error; long-range electrostatics will therefore requirecommunication in addition to that just described.)

Whatever decomposition is used, parallel MD codes face a basic challenge common to all parallelapplications: as more processors are applied to any fixed problem, the running time of theapplication does not speed up linearly in the number of processors. That is, a parallel applicationrun on 128 nodes will probably not perform 128 times faster than the same application run on onenode; in fact, it will probably perform significantly slower than that. The ratio of the increase inperformance to the increase in the number of nodes is called theparallel efficiency. If anapplication runs 64 times faster on 128 nodes than on one node, then it achieves a parallelefficiency of 50% on 128 nodes. Parallel efficiencies in this range are not uncommon on largenumbers of nodes.

One of the factors that contribute to this less-than-perfect efficiency is the often inherently

sequential aspects of an algorithm: these are computations that can not be split among multiplenodes (or at least, for which the distribution of computational load cannot be extended beyond acertain level). If we can model an application as having an amount P of load that can be fullyparallelized and an amount S of load that must remain sequential, then we can model the run timeon N nodes as S + P/N (whereas for perfect parallel efficiency, the run time would need to be(S+P)/N). Such an application runs intoAmdahls law which, roughly speaking, warns thatparallelization can never improve performance by a factor of more than (S+P)/S, no matter howmany processing nodes are used, meaning that parallel efficiency approaches zero as the numberof nodes increases. For instance, if the sequential portion S is 20% of a problem and theparallelizable portion is only 80%, then the maximum performance improvement possible fromparallelization is a factor of five.

Another contributor to parallel inefficiency is communication costs: in virtually all parallelapplications, processing nodes must communicate with each other frequently during the course of

the computation. Sometimes this communication is inherently sequential, contributing to the effectof Amdahls law; but even when the communication is parallelizable, the total amount ofcommunication often increases as the number of nodes increasesor, looking at it another way,the ratio of communication to computation increases as the number of processors increases. Forinstance, visualize an MD simulation box divided according to a spatial decomposition. While the

14 There are a number of subtleties to correctly and efficiently implementing a force-baseddecomposition: In particular, the columns of the matrix must be permuted in a particular way forefficient communication. Also, Newtons third law of motion can again be used to advantage. Fordetails, see Steve Plimpton, Fast parallel algorithms for short-range molecular dynamics, JCP1995.

13 of 30


14/30


computation performed by each node is (roughly) proportional to the volume of the box assigned toit, the communication performed by that node will be (roughly) proportional to the surface area ofthat box. As more nodes are used, the size of each box decreases, and thus the ratio ofcommunication to computation increases. Thus the total runtime of the simulation cannot decreaseas fast as the number of nodes increases, leading to decreased parallel efficiency with more nodes.Because of this phenomenon, measures of scaling in parallel applications must differentiatebetween scale up, which is the performance improvement seen when the problem size increasesproportionally to the number of nodes, and speedup, which is the improvement seen when the

problem size is held constant while the number of nodes is increased. (Note that since measures ofscale up do not compare equal problem sizes, some appropriate measure of performance must bedevised, e.g., particle pairs interacted per unit time in an MD code.) Speedup, equivalent to dividinga constant-size simulation box into more and more boxes that are made smaller and smaller, issubject to the communication/computation problem just described; scale-up, equivalent to addingmore and more constant-size boxes to make a larger simulation box, is not.

Molecular dynamics applications, however, are capable in general of very high efficiencies, andparallelization has been applied successfully to MD on general-purpose parallel machines. Probablythe most scalable software package for MD is called NAMD. NAMD has been successfully ported to anumber of hardware systems. On supercomputers with fast interconnects, NAMD has beenparallelized to well over a few thousand processors, but only for simulations of very large molecularsystems. For a smaller system size, around 30,000 atoms, NAMD scales well up to about 150 nodeson a high-end cluster. (Note that it has been claimed that recent improvements to the AMBER suiteof programs make AMBER competitive with or superior to NAMD.)

Supercomputers have also been applied to computational biochemistry. IBMs Blue Gene project, acollection of high-performance architectures that (at full size) will each contain tens of thousands ofnodes, was inaugurated with protein folding as its flagship application. Blue Gene began as anarchitecture in search of applications: by reducing the memory per processor and integratingseveral simpler processors on a single chip, its designers conceived of a machine capable ofreaching one petaflops of performance. Molecular dynamics codes turn out to have a high ratio ofcomputation to memory, and thus are suited to the Blue Gene architecture. (Due to changes inbusiness priorities, however, the primary goals of the Blue Gene project appear to have shiftedaway from molecular dynamics and more toward general-purpose supercomputing, although thedevelopment of Blue MatterIBMs MD codecontinues, and production runs of problems incomputational biochemistry are currently underway.)

2.3.5 Specialized hardware

The two techniques discussed aboveapproximation algorithms and parallelizationcan beimplemented on general-purpose computing platforms. To understand the motivation for designingand manufacturing a specialized platform for our target problem domain, let us first note somefeatures of general-purpose platforms.

At the core of any such platform are general-purpose programmable processors, each implementedas an integrated circuit on a small chip of silicon. An example is the Pentium processor in a desktopcomputer. We will focus our discussion on features of these general-purpose processors and therelative advantages possible in more specialized processors.

A programmable processor is designed to execute an almost arbitrary sequence of instructions. Inthe von Neumann architecture15used by all commodity processors today, both the program and

data reside together in a memory bank, which may be off-chip. The program consists of a sequenceofinstructions, which are mathematical or logical operations to be performed on the data. A central

processing unit(CPU) reads a stream of instructions and data from memory and writes results backto memory. The memory is called random-access memory(RAM) because the CPU can access anypart of it at any time, in any order, as opposed to sequential memories, which must be read in orderfrom start to finish.

A programmable processor devotes a significant amount of hardware to the overhead of managinginstructions and data. Before an instruction can be executed, it must be fetchedfrom memory, its

15 After John von Neumann, the scientist credited with creating the architecture.

14 of 30


15/30


compact form must be decodedto produce a set of signals that direct the rest of the execution. Itsoperands may need to be loaded, either from memory or from temporary storage locations calledregisters (which are generally packaged together in a unit called a register file). After the operationhas been performed, the results must be written back either to memory or to the register file.

Several techniques are used in the architecture of modern processors to accelerate the executionof a program. To begin with, the circuits that perform the instruction processing are oftenpipelined.Pipelining is a technique in circuit design that allows one set of data to begin processing before the

last set is finished, much like an assembly line in a factory. Several instructions can thus be in thepipeline at once. Despite its benefits, pipelining instructions rather than processing them one at atime also introduces complications. Extra wires (called forwarding or bypass signals) are usuallyadded to allow the results of one instruction to be forwarded directly to an instruction close behindit in the pipeline. If this is not possible, the pipeline must be stalledwhile the latter instruction waitsfor data from the former be stored to and fetched from the register file, thus wasting time andreducing utilization of the pipeline, (The ability to stall also requires more circuitry.) Furthermore, ifthe program hits a decision point (a branch) and determination of the next instruction to beexecuted depends upon the results of a calculation or a value in memory, the pipeline cannotgenerally begin the next instruction immediately. Often a large amount of circuitry is devoted to abranch predictor, which attempts to predict the next instruction (and can back up and throw out acomputation if its prediction turns out to be wrong) or to a speculative engine that executes bothbranches and throws out the one that turns out to be wrong when evaluation of the correct choiceis completed.

General-purpose processors also suffer from the von Neumann bottleneck. Todays processors havethe capacity to execute an instruction every few hundred picoseconds, but accessing main memorywhich is very often large enough that it must reside on separate chipstakes hundreds ofnanoseconds. Thus a modern processor has the capacity to execute instructions far faster than itcan read them and their operands out of memory. The limit on performance is not the speed atwhich the arithmetic and logic unit (ALU) can execute instructions, but the speed with whichmemory can feed the ALU. To alleviate this problem, modern computers use one or more caches.A cache is a small bank of high-speed memory that stores a subset of the data that resides in mainmemory. If needed data can be retrieved quickly from the cache (an event called a cache hit), theprocessor does not need to wait for it to be retrieved from main memory. If nota cache misstheprocessor must wait for main memory. Often the performance of a modern application is dependentprimarily on its cache hit rate. A large portion of a modern general-purpose processor is devoted to

an on-chip cache.A specialized chip, designed specifically to accelerate force field evaluation rather than to executean arbitrary instruction stream, can have significant advantages over a general-purpose processor.To begin with, once the hardware is no longer constrained to be Turing completethat is,sufficiently expressive for the calculation of arbitrary computible functionsan architecturespecialized to force-field evaluation can make better use of many processing elementsin parallelon a chip. (The natural parallelism of force field evaluation was discussed in the previous section.)Beyond the obvious advantages of parallelism, note also that parallel processors can be moreefficient when co-existing on a single chip than when they are spread out among chips, because ofthe relative efficiency of inter- to intra-chip communication noted above. In general, the moredensely the processing elements (PEs) can be packed on a chip, the more efficient the system as awhole will be if multiple such chips are used in parallel.

Note further that the number of PEs that can be placed on a chip is constrained by limitations onthe physical size (area) of the chip and the amount of power it can consume.16 The less area andpower a PE consumes, the more of them can be packed on a chip. Thus a special-purposearchitecture must look for savings in area and power of its PEs as well as improvements inperformance. To this end, the architecture can specialize its pipeline(s)to the specific calculations

16 The limitations on area and power come from engineering tradeoffs that must be made betweenthese and other factors, such as cost, reliability, schedule, and engineering complexity. The mainconcerns are these: (1) As a chip design gets larger, it becomes more likely that any fabricatedinstance of that chip will have defects, which decreases theyieldof usable chips from a fabricationrun and thus increases the manufacturing cost per usable chip. (2) A chip that consumes a lot ofpower will require a more complex cooling system, which will be more prone to run-time failure.

15 of 30


16/30


it must perform. These pipelines can include specialized mathematical units that would not exist ina general-purpose processor: for example, a unit to evaluate an arbitrary function by a table lookupand/or polynomial approximation. Such specialized units can be faster and consume less power andarea than an equivalent collection of more general units that would implement the samefunctionality in a general-purpose processor. At the same time, needless units can be eliminated(floating-point arithmetic units, if no floating-point computation is used, for example), saving morepower and area. Even the more commonplace arithmetic and logic units in a specialized pipelinecan be tuned for performance and for area- and power-efficiency. In a general-purpose processor,

these units must operate on a standard-width collection of bits, typically 32 or 64 bits in todaysprocessors. This standardization is not necessary in a specialized pipeline; the precision of eacharithmetic unit can be tuned to the needs of the application. This sort of width-tuning is particularlyhelpful for multipliers, for example, which consume area and power proportional to the square ofthe width of the words on which they operate.

A highly specialized architecture can even go further and completely eliminate the instructionstream, hardwiring a fixed calculation rather than allowing programmability. A hardwired circuitdoes not have the overhead of fetching and decoding instructions or reading and writing registersand memory. A hardware unit that implements a mathematical operation can be connected directlyto the mathematical units that come before and after it; this hardwired pipeline can do in one stepwhat requires several steps in a general-purpose processor. In addition, eliminating the circuitryrequired to process instructionsnot just that for fetching and decoding but also the large amountof circuitry devoted to forwarding, stalling, branch prediction, and the speculative engine, inaddition to any on-chip caches dedicated to instructionsclears up a lot of space on the chip(which, again, can be used to increase the degree of parallelism).

Finally, the architecture can stream the particle data rather than reading it out of a random-accessmemory. Compared to a randomly chosen program, the data access pattern of computing pairwiseinteractions is relatively simple and regular. Through clever control of the flow of the data, it can allgo by the PEs in a stream rather than having to be called up from RAM one unit at a time. Thisstreaming, together with the omission of instructions and of the need to fetch data from memory,effectively eliminate the von Neumann (memory bandwidth) bottleneck, allowing the pipeline tospend less time idling and more time computing.17 In addition, elimination of data caches and ofhardware for interfacing with and controlling memory clears up even more space on the chip topack functional units. Of course, not all such hardware will be eliminated. There will be least amemory controller on the chip, for handling very large system. There will be caches, buffers and

small local memories. A degree of programmability must be present to facilitate variation of thecomputation within the narrow range of algorithms of interest, and to handle the unknownunknowns problem of unanticipated needs that arise later in the machines lifetime.

To summarize, the effect of these architectural differences, specifically on the implementation ofmolecular dynamics algorithms, is that:

The PE itself can be made faster than a general-purpose instruction processing unit,computing more interactions in a given unit of time.

The von Neumann bottleneck can be eliminated, allowing data to flow to the PEs at a highrate and thus improve their utilization.

Many PEs can be packed on a chip, to attain a high degree of parallelism, further increasingthe performance relative to a general-purpose processor.

The density of this packing means that more communication can be done on-chip ratherthan between chips, lessening the communication burden of the system.

At this point, we must note a few major disadvantages of designing and implementing specializedhardware. To begin with, any project to do so will be very costly. Manufacturing costs are high, as isthe cost of engineering time for architecture, design, implementation, verification, and debugging.To build an ASIC costs tens of millions of dollars. For a full system one must add the cost of thecircuit boards that hold the chips, the interconnection network, cooling and many other

17 Note, for example, the success of Geometry Processing Units (GPUs), a sort of stream processorfor calculations common in computer graphics.

16 of 30


17/30


components. Time is another concern: such a project takes years. It must be remembered thatwhile specialized hardware is being designed and built, the performance of general-purposeprocessors is improving exponentially with time, doubling about every 24 months. A specializedsystem that will not exist for four years must be compared not against existing cutting-edgegeneral-purpose processors, but against processors four timesas powerful. In addition, there is therisk that while the construction of the machine is underway, new methods will be developed newalgorithms, perhaps, or force field improvements or paradigm adjustmentsthat the more-narrowlytargeted ASICs can not implement but that programmable processors can. Careful tradeoffs must

be made to allow some degree of flexibility without seriously degrading performance. Because ofthese downsides, specialized hardware is generally only justified for extremely computationallyintensive problem domains whose algorithms are not likely to change significantly and that offer ahigh payoff when solved.

We believe that structure-based problems in computational biochemistry satisfy these criteria. We are not alone in this belief:several other groups have also set out to create specialized hardware for force field evaluation and MD. The most notable ofthese is the MD-GRAPE18 project. The latest existing architecture from this project is MD-GRAPE3, which places special-

purpose, MD-specific functional units into a parallel coprocessor.As announced in August of 2004:It was fabricated by Hitachi Device Development Center HDL4N 0.13m technology. It has 20 pipelines for forcecalculations which operate at 300 MHz at the typical case. The chip performs 660 equivalent-operations per cycleand has the peak performance of 198 Gflops. The power dissipation is 19 W at 350 MHz(fastest) or 16 W at 200MHz (typical).

A special-purpose machine for molecular dynamics simulations, called the Protein Explorer19, isexpected to be built from 6144 MD-GRAPE3 chips by 2006, with a one petaflops nominal peakperformance goal. The target applications of Protein Explorer are somewhat different to those ofour machine, since the MD-GRAPE project has focused more on speeding the force calculation on asingle chip than improving the inter-chip communication. Protein Explorer therefore targetsapplications that are likely to be compute-bound, such as the simulation of huge proteins or proteincomplexes over short periods of time. We expect that it would be very inefficient to use the entireProtein Explorer machine to simulate a single medium-size protein. . (Note that IBMs Blue Geneprojectalthough it was inaugurated with protein folding as its flagship applicationis not a projectto build application-specific hardware. The Blue Gene architecture is that of a general-purposeprogrammable supercomputer, although its design makes it more suitable for certain classes ofapplications, like MD, than others.)

For the purpose of comparing the several architectures described in the last few sections, weconsider the following useful way of describing the speed of MD simulations: by measuring theirslowdown, the ratio of real time spent running the simulation to the simulated time representedwithin the simulation itself, on a chemical system of moderate size (~26,000? atoms). Todaysfastest single-processor code for molecular dynamics, GROMACS, takes a tenth of a second to a fullsecond of real time for each femtosecond of simulated time, for a slowdown near 1014 or 1015.Todays fastest parallel general-purpose code, NAMD, takes a few milliseconds per simulatedfemtosecond, for a slowdown near 1013. Early simulations of the (unbuilt) Blue Gene/C machineforecast 375 microseconds of real time to perform 10 femtoseconds of simulated time, for aslowdown of about 4x1010. (Recent results20on Blue Gene/L, a different and more generic class of

18 GRAPE, for GRAvity piPE, is a project to use specialized hardware for gravity computations in N-body simulations. MD-GRAPE is a spinoff project to adapt the GRAPE architecture for moleculardynamics, motivated by the fact that the electrostatic interaction is identical in form to thegravitational interaction.19 M. Taiji, T. Narumi, Y. Ohno, N. Futatsugi, A. Suenaga, N. Takada, A, Konagaya, Protein Explorer:

A Petaflops Special-Purpose Computer System for Molecular Dynamics Simulations,SuperComputing (SC03), November 15-21, 2003, Phoenix, AZ.20 B.G. Fitch et al., Blue Matter: Strong Scaling of Molecular Dynamics on Blue Gene/L, IBM ResearchRepot RC23688 (W0508-035), August 2005. The report notes scalability through 16,384 nodes witha measured time per time-step of just over 3 milliseconds for a 43,222 atom protein/lipid system,equivalent to the rate of 50 nanoseconds per day. Also noteworthy is their ability to scale theproblem up to fewer than three atoms per node.

17 of 30


18/30


machine, have achieved a simulation rate of 50ns per day on a 16,384 node machine.) ProteinExplorer includes a range of possible configurations, but is estimated to be slower than BlueGene/C.

3 DESRADs approachFor the reasons given above, we have decided to design an architecture for accelerated force-fieldcalculation and molecular dynamics. The core of the system will evaluate force fields (calculating

either the total potential energy of the system or the force on each atom due to its interactions withall other particles, depending upon the mode of the system); the rest of the systemmostly in thatportion which we call the flexible subsystemwill handle the computation of the rest of atimestep (including integration, temperature and pressure control, and a variety of other tasks).

Our design target is for a machine 21of 512 processors22, that takes approximately 6s of real timefor each simulated femtosecond, for a slowdown by a factor 6109. This would allow a millisecondof simulated time to be simulated in just under 69.44 days. (Many proteins fold within 20ms.)

Our architecture incorporates all three major techniques described above to accelerate force-fieldevaluation and MD: It incorporates specialized hardware in a massively parallel system toimplement efficient algorithms. What follows is a description of our approach, including a very highlevel architectural overview and a description of the most important details of design. Selectedopen issues in the architecture and design are also mentioned.

The current timetable for the completion of this machine sees the first generation able to performproduction MD runs in the 2008-09 timeframe. In the interim, we have also implemented high-performance software for massively parallel MD on a dedicated commodity cluster. These MD codeshave served as a test-bed, both for algorithm development and for exploration of numerical issues.But, perhaps more important, they have served as a platform for the ongoing research of DESRADschemists and its vistors during the years while the machine is still under development. While theseproduction quality MD codes are not the focus of this document, we briefly look at theirimplementation in the following subsection.

3.1.1 DESMOND

DESMOND is a molecular dynamics code. Like the DESRAD machine, it is divided into a number ofsubsystems: flexible, middle, and distant subsystems. Its middle subsystem implements the Neutral

Territory Method to improve data locality. Its distant subsystem uses the GSE method. Multipletimestepping is implemented with RESPA.

The first version, DESMOND Phase 1, is a single node implementation that simulates the behavior ofthe proposed DESRAD machine, and serves as a validation of the machines functionality.DESMOND Phase 1 has been used to experiment with algorithms, numerical precision and stabilityissues, and parameter choice for the machine.

The current version, DESMOND Phase 2, is a distributed-memory parallel code that uses MPI formessage passing23. Like Phase 1, it is multithreaded and use streaming SIMD extensions (SSE).Phase 2 Desmond is currently hosted onand best optimized fora commodity cluster which wehave deployed and dedicated exclusively to MD runs; the cluster features more than 1024 dual-core AMD Opteron processors interconnected by an InfiniBand network in the fat-tree topology.At this time, DESMOND Phase 2 is being used by DESRAD chemists for production computational

experiments. The goal for Phase 2 is to achieve 1 microsecond of simulation time per week of wall-clock time, for typical molecular systems.

21 We refer to instances of this hardware platform as the machine or the DESRAD machine.Neither of these terms should be used as a proper noun since, at present, the machine has noname.22 A larger configurations, up to 4096 nodes, are also expected.23 We may in fact develop our own message passing layer, if necessary, for better and moreconsistent performance.

18 of 30


19/30


Unlike the machine and Phase 1, DESMOND Phase-2 uses a different NT variant called the MidpointMethod24. Experiments have shown that the Midpoint Method has benefits on architectures like ourcluster, where compute cycles are cheap relative to communication costs. On our cluster, theMidpoint Method is expected to incur a lower inter-processor communication cost on runs using upto 512 nodes.

3.1.2 Target molecular systems

It is helpful when considering our proposed architecture to have in mind the types of molecularsystems we expect to model.

Several times we have talked about typical molecular systems. And indeed, it is the case thatboth the hardware and software we are developing are optimized for molecular systems ofparticular sizes, corresponding to common classes of experiments involving solvated protein orprotein-ligand systems, or a protein embedded in a solvated lipid bilayer. The standard systemthat we use for most of our analyses is a cubical simulation box, 64 angstroms on a side. This islarge enough to hold a typical protein and enough surrounding water to serve as an electrostaticscreen when the system is periodically tiled. The typical particle density for such a system is aboutone atom per 10 cubic angstroms of space, so a 646464 angstrom cube would contain about26,000 atoms.

This standard system, however, is somewhat smaller than average for the molecular systems weare interested in. In particular, systems that include a cell membrane are usually significantlybigger and are often non-cubical. (Such systems maybe be pictured consisting of the lipid bilayer ofa cell membrane extending off infinitely in two dimensions, and surrounded by water on both sides.Proteins may be in or on the membrane. The entire system is also tiled periodically in the thirddimension as well.) A typical system of this sort would be 64 to 128 angstroms in the dimensionnormal to the membrane, but 256 angstroms by 256 angstroms in the other dimensions. Such asystem would hold closer to 400,000 or 800,000 atoms.

3.2 Architectural overview

Our architecture is a massively parallel system with hundreds to thousands of identicalprocessingnodes, all connected by a high-speed network. The core of each node will be an ASIC thatimplements all of these subsystems, together with a memory subsystem, the primary task ofwhich is force accumulation, the summing-up of force terms produced by the various subsystems

across the ASIC. The memory subsystem also serves as a DRAM controller, providing access to upto 2GB of external DDR2 DRAM per node. With the addition of DRAM and an implementation of aDRAM-mode for the machine25, we expect the machine to scale to very large systems (one to twoorders of magnitude larger than our typical targets) with a graceful degradation in performance.

We plan to use a spatial particle assignment in our pairwise interaction decomposition, because ofthe communication advantages of this method described in a previous section, particularly for themidrange subsystem, which implements a cutoff. Since processing nodes using this assignmentscheme communicate with the nodes closest to them in space, we plan to connect the nodes of oursystem with a three-dimensional toroidal mesh, where each processor is directly connected to itssix nearest neighbors (in 3-space). The periodic boundary conditions of our simulations arereflected directly in the network topology, with nodes along the leftmost face, for example, alsoconnecting directly to their neighbors along the rightmost face. The technology selected for the

inter-node network is NUMALink.

24 In the Midpoint Method, the interaction between each pair of particles is computed by the nodethat owns the midpoint between those particlesi.e., the mid-point falls within the homebox of thenode. Each node is responsible for the particles within its home box, but also contains copies ofparticles of adjacent nodes that are within approximately half the cutoff radius.25 At present, although a DRAM-mode is being designed into the machines first version (version1.0), full support for this mode is not expected until version 1.5.

19 of 30


20/30


21/30


whether particles or grid points, locally within a specified cutoff radius. One of its chief tasks is thecomputation of midrange (less than 10-15 angstroms) non-bonded interactions through full pairwisesummation. However, it is also used in the real-space components of the distant calculation, forcharge spreading and force interpolation. In the present section, we shall focus on theimplementation of the midrange calculation on this subsystem. We shall return to the distantcalculation in the next section.

The midrange interactions comprise the attractive and repulsive van der Waals terms, together

with those portions of the electrostatic computation from the Ewald summation that are naturallyevaluated in real space. All of these computations can tolerate the imposition of a similar spatialcutoff.

At the system level, the computation uses a new decomposition method for pairwise interactions,the NT Method. The NT method combines advantages of spatial- and force-based decompositions:like a spatial decomposition, it assigns each node responsibility for the atoms in a region of space,allowing for nearest-neighbor communication. Like a force-based decomposition, it gives each noderesponsibility for calculating the interactions for a subset of particlepairs, carefully chosen tominimize communication requirements.

The NT Algorithm uses the spatial assignment of atoms to nodes induced by a regular partitioningof space. Once atoms are assigned to nodes, the next consideration involves where the interactionsfor each nearby pair of atoms is computed. For a typical spatial decomposition, a natural approachis to assign these pairs to nodes so that each PE computes interactions between its own particlesand any external particle within the cutoff radius. Because of the symmetries in the problem(actually, antisymmetries) just doing this will result in the redundant interaction of certain pairs ofparticles assigned to different nodes. To remove those symmetries, we might instead assign pairsso that each PE computes interactions between its own particles and external particles with agreaterx-coordinate, for example.28We call this approach the half-shell method, since it requiresthat each PE import the positions of all atoms that happen to lie within a hemisphericalneighborhood of its assigned region of space. We can also view this in another way, which helps tounderstand the approach taken in the NT method: for any two particles separated by less than thecutoff, the node that will calculate their interaction is the node that owns the particle with thesmallerx-coordinate. Recall that on each timestep, every PE must importdata on the positions andproperties of particles that it must interact, and later exportdata on the forces or energies due tothose interactions. In the half-shell method, each PE imports data from and exports data to a regionof the simulation space corresponding to half of a shell with thickness equal to the cutoff radiussurrounding that nodes assigned box.

The NT Algorithm classifies these pairwise interactions differently: for any pair of particlesseparated by less than the cutoff, the interaction will be calculated by the PE that owns the box inthex-y plane of the particle with the greaterx-coordinate and in thez-line of the particle with thesmallerx-coordinate.29 Stated more simply: we look at the space of the simulation as a 3d-vectorspace, which is also the outer-product of the 2d-span of thex,y-units, on the one hand, and the 1d-span of thez-unit on the other hand. A particular node containing a point (X,Y,Z) will computeinteractions between any pair of points where one projects onto (X,Y) and the other projects onto(Z), and both are separated by no more than the cutoff. This is the neutral territory where thepoints will meet and where the iteration will be computed. To import all of the points for which it isresponsible, each node n must only import particle data from a portion of the column, called itstower, in which it lies. Those other nodes that contain points within the cutoff radius ofn that also

project onto the x,y-coordinates of the box of space assigned to n, as well as a certain slab of thesimulation space, called itsplate, those other nodes within the cutoff ofn with the same z-coordinates as ns box. (As in the half-shell method, redundancies arising from symmetries areaddressed by cutting the plate in half.) The most significant advantage of this method over the

28 This statement is actually an oversimplification made for simplicity of exposition. For a completediscussion and analysis of this algorithm, please see David E. Shaw,A fast, scalable method for theparallel evaluation of distance-limited pairwise particle interactions, J Comput Chem 26: 1318-1328,2005.29 Again, this is an oversimplification for expository purposes, which does not describe thedecomposition exactly.

21 of 30


22/30


half-shell method is a clear reduction in the amount of communication required. For example, if oursimulation space is uniformly partitioned into boxes of width greater than the cutoffin otherwords, partitioned so that every pair of particles that we need to compare is either resident in thesame box, or lie in neighboring boxesthen the half-shell method would require each node toimport particles from half of its 26 neighbors. On the other hand, the NT method requires thatparticles be imported only from 7 boxes (two in its tower and 5 in its plate). As the cutoff radiusgrows larger relative to the box size, the communication requirements of the two approachesstrongly favor the NT Algorithm.

At the node level, the midrange subsystem exhibits additional parallelism. An instance of thissubsystem consists of an array of units calledparticle-particle interaction modules, or PPIMs.30 EachPPIM will contain several matchmaking units; particles from a nodes tower are distributed amongthe matchmaking units of each PPIM in this array. Particles from its plate stream past thematchmaking units, which then decide for which tower particle/plate particle pairs interactionsshould be computed, based primarily (but not exclusively) on the distance between the particles.Selected pairs are passed to a deeply pipelined a processing element called theparticle-particleinteraction pipeline, or PPIP.31The PPIP then produces either the force on each particle due to theother or the potential energy contribution from the pair. (There are also other modes in which thePPIP can calculate the contribution of the pair to pressure or potential.) At its heart, however, thecurrent design for the PPIP can be configured to compute a rather general function of the particles,based on their identities and inter-particle distance, and calc

Documents

Des Rad Gettingstarted Project Overview