05/03/2008 MSc in High Performance Computing Computational Chemistry Module Lecture 7 Parallel Approaches to Quantum Chemistry (i) Replicated Data Parallelism

05/03/2008

MSc in High Performance ComputingMSc in High Performance ComputingComputational Chemistry ModuleComputational Chemistry Module

Lecture 7Lecture 7Parallel Approaches to Quantum ChemistryParallel Approaches to Quantum Chemistry

(i) Replicated Data Parallelism (i) Replicated Data Parallelism

Huub van Dam and Paul SherwoodSTFC Daresbury Laboratory

[email protected]

Outline of the LectureOutline of the LectureThe GAMESS-UK package

– Parallelisation of the SCF process– Expense of computational steps – integral

generation and eigensolution• Static load balancing• Dynamic load balancing

– Parallelising Linear Algebra– Dealing with I/O requirements, mapping files to disk– Observed performance of the MPI version– Two-level Parallelism

• Task Farming in genetic algorithms– QDVE – generation of viable catalysts

• Reaction Path methods– Minimisation of reaction path – chorismate

mutase enzyme

GAMESS-UKGAMESS-UK● Generalised Atomic and Molecular Electronic Structure System

– Developed and maintained over 20 years as part of CSED’s support for CCP1

– now over 1,200,000 lines of Fortran source● Functionality

– HF, MCSCF, MP2, CI wavefunctions– Density Functional Theory (including Hessians)– excitation and ionisation energies (OVGF, 2ph-TDA,

RPA...)– a wide range of properties and analysis tools– QM/MM Implementations (CHARMM, ChemShell)– Molec. Phys. 103 (2005) 719-747

● Active developments:

– exited state energies and forces from time-dependent DFT (jointly with PNNL)

– NMR chemical shifts

261 Licensed Groups see: http://www.cfs.dl.ac.uk

Scaling of Molecular Scaling of Molecular Computations Computations

The relative computing power required for molecular computations at four levels of theory. In the absence of screening techniques, the formal scaling for configuration interaction, Hartree-Fock, density functional, and molecular dynamics is: N6, N4 , N3 and N2 , respectively.

Parallelisation of GAMESS-UKParallelisation of GAMESS-UK● 1985: HF-SCF module initially parallelised using message passing (TCGMSG,

later PVM and MPI) - for iPSC class machines

● 1990s: Use global memory through the Global Array Tools (PNNL)– Data objects can be

distributed and accessed without synchronisation• e.g. in-core storage of

integrals – enables a wider range of

algorithms to be tackled • e.g. parallel MP2, mapping

I/O to memory access ...– Core of the program still uses

a replicated data strategy • node memory limits the

maximum system size

Single, shared data structure

Physically distributed data

● 2003: Implementation of a partial distributed data strategy– New F90 matrix module, mapping most of the matrices to global memory

• can be re-used for other codes, e.g. Crystal– Standards-based (MPI, ScaLAPACK)

Self Consistent Field Method Self Consistent Field Method (SCF)(SCF)

● SCF Theory – each electron interacts with a mean potential created by the other (N-1) electrons

● SCF method derived by assuming a specific form of the solution to the QM equation – the Schrodinger equation (HF theory) or the Kohn-Sham equation (DFT theory), leading to a set of coupled equations (set of integro-differential equations that could be solved numerically).

● More common to expand the solutions in a finite set of primitive functions (the basis set). Set of equations become a set of coupled homogeneous equations that are usually written in matrix form.

● Eigen values and eigen vectors of the matrix describing particle interactions are required; because of coupling in the matrix (matrix defined in terms of its solutions) a self-consistent solution is sought.

● This implies an iterative process, iterating until the Fock matrix remains constant from iteration to iteration.

Schematic SCFSchematic SCFMOAO represents the molecular orbitalsP the density matrix and F the Fock or Hamiltonian matrix

MOAO

P

dgemm

Integrals

V XC

V Coul

V 1e

Sequential Eigensolver

F

guess orbitals

If Converged

Computationally most expensive stepsIntegral generation O(N4) to O(N2.5)Exchange-Correlation Quadrature (DFT only) O(N)Fock Matrix construction from Integrals O(N4) to O(N2.5)Diagonalisation O(N3), Orthogonalisation O(N3)Matrix Multiply O(N3)

F=H0+P[()-1/2()]

2-Electron Integral Generation 2-Electron Integral Generation LoopLoop

Integrals are computed in a 4-nested loop over basis function shells– Basis functions within a shell share exponents

do ish=1, nsh

do jsh = 1, ish

do k = 1, ish

do l = 1, ksh

Compute batch of integrals

Store (conventional SCF)

or

Multiply by P to get F (direct SCF)

enddo

enddo

enddo

enddo

F=H0+P[()-1/2()]

Parallelisation with static task Parallelisation with static task allocationallocation

Use node number to assign tasks

do ish=1, nshells

do jsh = 1, ish

do ksh = 1, ish

do lsh = 1, ksh

if (mod(nnodes,lsh) .eq. mynode) then


Multiply integrals by P to get F

endif

enddo

enddo

enddo

Enddo

Call global_sum (F)

do ish=1, nshells

do jsh = 1, ish

do ksh = 1, ish

if (mod(nnodes,ksh) .eq. mynode) then

do lsh = 1, ksh



enddo

endif

enddo

enddo

Enddo

Call global_sum (F)

Parallel SCF considerationsParallel SCF considerations

● Conventional SCF, we are creating one file on each node. Some logic is needed to support this (GAMESS-UK uses special file names such as ed2000, ed2001….. etc

● Not all tasks are the same size

– If ish, jsh, ksh, lsh are indices of shells comprising a single function, there is only a single integral

– If ish, jsh, ksh, lsh are indices of shells containing Px,Py,Pz

there are potentially 3x3x3x3 integrals (some equivalent by symmetry).

– Even more for d, f, g orbitals

● All nodes must wait for global sum at the end

– “slowest” node controls speed of execution– Dynamic allocation of tasks can reduce load imbalance

Dynamically Load Balanced Dynamically Load Balanced SCFSCF

itask = next_task()

icount = 0

do ish=1, nsh

do jsh = 1, ish

do ksh = 1, ish

do lsh = 1, ksh

if (icount .eq. itask) then



itask = next_task()

endif

icount = icount + 1

enddo

enddo

enddo

Enddo

Call global_sum (F)

Implementing dynamic load Implementing dynamic load balancingbalancing

● Some toolkits provide a “global counter” – e.g. GAMESS-UK was first parallelised using TCGMSG,

which provides NXTVAL() call• Implementation is quite machine dependent

● When using MPI-1, an additional task can be assigned to hold the counter and reply to incoming requests– Quite wasteful for small node counts– Basis of GAMESS-UK dynamic load balanced MPI version

● Dynamic allocation works better if large tasks are encountered before small ones– More efficient to reverse loop orderings in integral

generation• generally better to use g, f, d, p, s ordering of shells

rather than s, p, d, f, g….

Further Parallelisation of the SCF Further Parallelisation of the SCF ProcessProcess

Parallel Diagonalisation

– PEiGS: G.I.Fann, R.J. Littlefield. Parallel inverse iteration with reorthogonalisation, paper presented at the Conference on Parallel Processing for Scientific Computing, SIAM, pp. 409-413

– Solves dense real symmetric problems (Ax=lx) and generalised (Ax=lBx) eigen problems.

– Numerical method used is multisection for eigenvalues and repeated inverse iteration and orthogonalisation for eigenvectors.

– Guarantees orthogonality of eigenvectors, even for arbitrarily large clusters that span processors.

– ScaLAPACK:• PDSYEVX, PDSYEVD, PDSYEV – a variety of

routines and algorithms

Symmetric Eigensolver Routines

Parallel Linear AlgebraParallel Linear Algebra

● ScaLAPACK– drivers for solving standard and generalized dense

symmetric or dense Hermitian Eigenproblems. – PDSYEV (QR Method) (Scalapack 1.5)– PDSYEVX (Bisection & Inverse Iteration) (Scalapack

1.5)– PDSYEVD (D&C Method) (Scalapack (1.7)

● BFG (I. Bush)– Block Jacobi Method on full dense symmetric

matrix (+ Hermitian)● Plapack

– QR method– MRRR ‘Multiple Relatively Robust Representations’

Parallel Diagonalisation - Scalability of Parallel Diagonalisation - Scalability of AlgorithmsAlgorithms

Real symmetric eigenvalue problemsFock Matrix, N = 1152

0

0.5

1

1.5

2

8 16 32 64

Number of Processors

Tim

e (

secs)

Peigs 2.1 (PDSPEV) Scalapack (PDSYEV)

Scalapack (PDSYEVD) BFG

Parallel EigensolversParallel Eigensolvers

PDSYEVD Performance for Fock Matrix, (CRYSTAL), N_basis = 3888

0

2

4

6

8

10

12

14

16

16 32 64 128 256 512

Number of Processors

To

tal

Tim

e (s

eco

nd

s)

.

IBM p690 IBM p690+ IBM p5-575

Further Parallelisation of the SCF Further Parallelisation of the SCF ProcessProcess

Algorithm Changes - Alternatives to diagonalisation

– The Hartree-Fock (HF) module may also be based on a quadratically convergent SCF (QCSCF) approach [G.B. Bacskay, Chem. Phys. 61 (182) 385].

– The SCF equations are recast as a non-linear minimisation which bypasses the diagonalisation step. This scheme consists of only data-parallel operations and matrix multiplications which guarantees high efficiency on parallel machines.

– Perhaps more significantly, QCSCF is amenable to several performance enhancements that are not possible in conventional approaches. e.g., • orbital-Hessian vector products may be

computed approximately which significantly reduces the computational expense with no effect on the final accuracy.

I/O ConsiderationsI/O Considerations● Most ab-initio programs rely heavily on I/O

– Integral files (for conventional SCF only)– GAMESS-UK reduces its memory footprint by saving data to disk

when not in use (ed3, ed7)– Data saved for restarting and later job steps (ed3)

● I/O in Parallel Implementations(i) All nodes maintain a copy of the file

– Good for machines with fast local disk on nodes– A limitation on machines with distributed file systems

(e.g. HPCx)(ii) Node 0 maintains a copy of the file on behalf of the parallel job

– Keeps I/O to a minimum– Each read operation has to be followed by a broadcast– This is the default for the MPI version of GAMESS-UK

– The files can be mapped into memory– Particularly useful when each node maintains a partial

copy (e.g. ed2)– Aggregate memory of parallel computer can then be

exploited

Characteristics of Integral-Parallel Characteristics of Integral-Parallel SCFSCF

● Very efficient for small node counts

– integral cost dominates

● Limited by Amdahls law

– Still a lot of serial code in the SCF that needs to be addressed (MxM, diagonalisation…)

● I/O and memory demands of the program have not changed

– considerable effort may be needed to make this efficient

● As an example, consider performance of GAMESS-UK MPI implementation

Parallelisation of GAMESS-UKParallelisation of GAMESS-UK

● Consider the performance of the MPI code.

● The impact of increasing molecule size and the balance between integral evaluation (parallelised) and the SCF steps (serial) in the MPI code. DFT calculations on;

– Morphine (6-31Gdp), 410 basis functions

– Cyclosporin (6-31G), 1000 basis functions, and

– Cyclosporin (6-31Gdp), 1855 basis functions

● All calculations performed on HPCx (Phase2A, p5-575 nodes)

Morphine (6-31Gdp), 410 basis Morphine (6-31Gdp), 410 basis functionsfunctions

150

43

13

207

75

23 20

119

38

1321

72

21

8

23

53

0

50

100

150

200

2e-ints XC SCF Total

16 CPUs

32 CPUs

64 CPUs

128 CPUs

Total Time (secs)

Computational Step

Morphine (6-31Gdp), 410 basis Morphine (6-31Gdp), 410 basis functionsfunctions

% Contribution of Computational Tasks

No. of Processors

15075

38

21

43

23

13

8

1320

21

23

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

16 32 64 128

SCF

XC

2e-ints

Cyclosporin (6-31G), 1000 basis Cyclosporin (6-31G), 1000 basis functionsfunctions

Total Time (secs)

Computational Step

794

251

146

1195

126163

72

175118

38

188

349

697

404

483

231

0

200

400

600

800

1000

1200


16 CPUs

32 CPUs

64 CPUs

128 CPUs


No. of Processors

Cyclosporin (6-31G), 1000 basis Cyclosporin (6-31G), 1000 basis functionsfunctions

794404

231

118

251

126

72

38

163

175

188

146

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

16 32 64 128

SCF

XC

2e-ints

Cyclosporin (6-31Gdp), 1855 basis Cyclosporin (6-31Gdp), 1855 basis functionsfunctions

Total Time (secs)

Computational Step

1988

214

3377

126

1211

71

11631045

2395

547

1825

1192

0

1000

2000

3000


32 CPUs

64 CPUs

128 CPUs

1988

1045

547

214

126

71

1163

1211

1192

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

32 64 128

SCF

XC

2e-ints


No. of Processors

Cyclosporin (6-31Gdp), 1855 basis Cyclosporin (6-31Gdp), 1855 basis functionsfunctions

Task FarmingTask Farming

● Trivial Parallelism– processors are divided up into groups– each group works on an independent task

• requires scalability to smaller processor counts

● e.g. GAMESS-UK– task farm version has been implemented

using MPI features (groups, communicators)– used for combinatorial applications, e.g.

Genetic algorithm design of catalysts.

Task Farming Example (QDVE)Task Farming Example (QDVE)

● QDVE on HPCx is a collaboration with Marcus C. Durrant, formerly John Innes Centre, now Northumbria University.– Use a genetic algorithm to find the most effective transition

metal complex for catalysing the reduction of N2 to N2H4.

– For each potential catalyst, reaction energies for each step in the catalytic cycle are calculated and the most successful complexes go through to the next ‘round’.

– Complexes are ‘bred’ and ‘mutated’ to create new complexes that will hopefully combine the most successful attributes of their parents.

– Process repeated though a number of successive generations until a complex with a desired efficacy is found.

● Implementation:– Reaction energies are calculated in ‘taskfarming mode’ i.e.

numerous small child jobs are run concurrently and in parallel on a subset of the total no. of processors allocated to the parent job.

The Catalytic ComplexesThe Catalytic ComplexesEight geometries

M

L2

S

L1L1

L1 L1 M

S

L2

L1

L1 L1 M L2L1

L1

S

M

S

L1L1L2 L2

M

S

L2L1

L1 L2

M

S

L1L2

L1 L2

L2

M

S

L2L1L1

M

S

L2L1

L1

(1) (2)

(3)

(4) (5) (6)

(8)(7)

L1= BH2-, CH3

-, NH2-, OH-, AlH2

-, SiH3-, PH2, SH-, GaH2

-, GeH3-,AsH2

-, SeH-, NH3, OH2, PH3, SH2, AsH3, SeH2

L2=H-, N3-, O2-, S2-, BH2-, CH3

-, NH2-, OH-, F-, AlH2

-, SiH3-,Ph2

-, SH-, Cl-, GaH2

-, GeH3-, AsH2

-, SeH- Br-, NH3, OH2, PH3, SH2,AsH3, SeH2

M=Transition Metal

S=Reaction Substrate

Each complex consists of:

- Core structure – metal M plus set of ligands L1 & L2;

- Substrate ligand, representing the different species shown in the cycle.

The Catalytic CycleThe Catalytic CycleM

NHH

NM H

M N N

H

H

M N

H

N

H

H

H

M N N

M N NH

M N N

H

M N N

H

H

N

N

H

M

H

+N2(1)

(2)

(3)(4)

(5)

(6)

H+/e-

H+/e-H+/e-

H+/e-

-N2H4

not

or

not

(A)

(B)

(C) (D)

(E)

(F)

(G) (H)

(I)

D8=EA+EN2H4-EI+10

1 B A N2D =E -E -E +15

D2 = EC-EB-E1/2H2-10

D3=EC-ED+10

D4=(EE or EF)-EC-E1/2H2-10

D5=EG-(EE or EF)-E1/2H2-10

D7=EI-EG-E1/2H2-10

D6=EG-EH+10

Spin state.

21143584D2Aidentifies transition metal by specifying the row and column of the periodic table.

Charge on the complex. Unique job

identifier.

Reaction species in the catalytic cycle

The geometry of the substrates and ligand around the metal.

The secondary ligand L2.The primary ligand L1.

NanogenesNanogenesIn molecular evolution, genes are transcribed into functional molecules (proteins) – survival of a gene determined by ability of its associated protein to carry out target chemical reactionIn this project, each complex is described by a nanogene that uniquely identifies each aspect of the complex and also provides a way of automating the breeding and mutation process.

1. Generate initial population of transition metal complexes by random methods.

2. Use GAMESS-UK to calculate the energy of each step in the catalytic cycle.

3. For each complex calculate an overall score (fitness) by comparing the calculated energies with a theoretical ideal value.

4. Use a set of selection rules to determine which complexes should go through to the next round.

5. Are the selection criteria fully satisfied?

QDVE process is completed.

6. Breed and mutate survivors to create the next generation of catalytic complexes.

No

Yes

The Genetic AlgorithmThe Genetic Algorithm

● Execution on large-scale parallel resources requires submission of small number of large jobs - requires task farming harness– dynamically allocate jobs to processors– support for automatic restart if required

● Batch processing of many automatically-generated model transition-metal containing structures presents some problems– Conventional SCF convergence schemes are not typically robust

enough to run without intervention (tuning of level shifters etc)– Modified driver to automatically choose between convergence

schemes based on diagnostics (energy changes, occ-virtual Fock matrix etc).

QDVE - Implementation QDVE - Implementation

root

slaves

master master

slaves slaves

root/master communicator Intra-group

communicator

Group running single GAMESS-UK job

master

● The method involves the definition of a reaction path via replication of a set of macromolecular atoms.

● Entails the simultaneous optimisation of a series of geometries of the reacting system, corresponding to a series of points along the reaction pathway.

● The replica path approach has been tested on the chorismate / prephenate rearrangement, illustrating how the PMF approach, based on the constraint forces acting on the non-equilibrium path structures can be used to extract a measure of the thermodynamics of the reaction from the active site atoms.

● The intermediates can be estimated by interpolation, as in the chorismate mutase reaction:

The Replica Path MethodThe Replica Path Method

PP3636

PP44

PP3232 PP3333

PP11PP00

PP3434 PP3535

PP33PP22

EE

Reaction Co-ordinateReaction Co-ordinate

The Replica Path MethodThe Replica Path Method● Involves minimisation of the reaction path (end points and

e.g. 20 intermediates) at the same time– The target function for the combined minimisation

comprises the sum of the configurational energies, together with a series of penalty functions which ensure that the structures represent the reaction path.

● We can parallelise over images as well as deploy parallel processing with each image

● Effectively exploit 500-1000 processors even with a replicated data code

Replica Path Replica Path ParallelisationParallelisation

● Classical part of the system, (both replicated and non-replicated MM regions), is computed using the standard CHARMM parallel code.

● For the QM calculation, however, the CHARMM communication subsystem is switched such that the processors are grouped into independent sets, each set working on one of the points on the pathway.

● The converged wavefunction for each point is maintained ready to initialise the next calculation.

PP3366

PP44

PP3322

PP3333

PP11

PP00

PP3344

PP3355

PP33

PP22

EE

RReeaaccttiioonn ccoooorrddiinnaattee

H.L. Woodcock, M. Hodoscek, P. Sherwood, Y.S. Lee, H.F. Schaefer III, B.R. Brooks, Theo. Chem. Acc (2003) 109, 140-148.

Test System: Chorismate Test System: Chorismate MutaseMutase

● The Chorismate mutase enzyme is well-studied by both theoretical and experimental methods.

● The system was solvated, leading to a total of ca 1500 atoms. Only one of the active sites in the trimeric enzyme was treated by the replica approach, the remainder by MM.

Chorismate Mutase Test Chorismate Mutase Test SystemSystem

Chorismate/Prephenate moiety , the only part treated by QM methods (thus avoiding any bonded QM/MM junctions

Replicated part of the system (6 Å cutoff) – with a different geometry at each point on the reaction path - is highlighted.

● Solvated system, leading to a total of ca 1500 atoms. Only one of the active sites in the trimeric enzyme was treated by the replica approach, the remainder by MM.

● The Chorismate to Prephenate rearrangement found to have H†† and Hrxn values of 14.9 and -19.5 kcal/mol. The activation enthalpy compares favourably with the expt.value of 12.7±0.4 kcal/mol.

● Close agreement between the energy profiles obtained from direct energetic analysis and from the PMF integration approach.

Computed Energy ProfilesComputed Energy Profiles

The QM/MM Modelling The QM/MM Modelling ApproachApproach

● Couple QM (quantum mechanics) and MM (molecular mechanics) approaches

● QM treatment of the active site– reacting centre– excited state processes

(e.g. spectroscopy)– problem structures (e.g.

complex transition metal centre)

● Classical MM treatment of environment– enzyme structure– zeolite framework– explicit solvent molecules– bulky organometallic

ligands

SummarySummary

● The GAMESS-UK package

– Parallelisation of the SCF process• Static load balancing• Dynamic load balancing

– Parallelising Linear Algebra– Dealing with I/O requirements, mapping files to disk– Observed performance of the MPI version– Two-level Parallelism

• Task Farming in genetic algorithms• Reaction Path methods

Documents

05/03/2008 MSc in High Performance Computing Computational Chemistry Module Lecture 7 Parallel Approaches to Quantum Chemistry (i) Replicated Data Parallelism