Upload
jocelyn-hudson
View
215
Download
3
Embed Size (px)
Citation preview
05/03/2008
MSc in High Performance ComputingMSc in High Performance ComputingComputational Chemistry ModuleComputational Chemistry Module
Lecture 7Lecture 7Parallel Approaches to Quantum ChemistryParallel Approaches to Quantum Chemistry
(i) Replicated Data Parallelism (i) Replicated Data Parallelism
Huub van Dam and Paul SherwoodSTFC Daresbury Laboratory
Outline of the LectureOutline of the LectureThe GAMESS-UK package
– Parallelisation of the SCF process– Expense of computational steps – integral
generation and eigensolution• Static load balancing• Dynamic load balancing
– Parallelising Linear Algebra– Dealing with I/O requirements, mapping files to disk– Observed performance of the MPI version– Two-level Parallelism
• Task Farming in genetic algorithms– QDVE – generation of viable catalysts
• Reaction Path methods– Minimisation of reaction path – chorismate
mutase enzyme
GAMESS-UKGAMESS-UK● Generalised Atomic and Molecular Electronic Structure System
– Developed and maintained over 20 years as part of CSED’s support for CCP1
– now over 1,200,000 lines of Fortran source● Functionality
– HF, MCSCF, MP2, CI wavefunctions– Density Functional Theory (including Hessians)– excitation and ionisation energies (OVGF, 2ph-TDA,
RPA...)– a wide range of properties and analysis tools– QM/MM Implementations (CHARMM, ChemShell)– Molec. Phys. 103 (2005) 719-747
● Active developments:
– exited state energies and forces from time-dependent DFT (jointly with PNNL)
– NMR chemical shifts
261 Licensed Groups see: http://www.cfs.dl.ac.uk
Scaling of Molecular Scaling of Molecular Computations Computations
The relative computing power required for molecular computations at four levels of theory. In the absence of screening techniques, the formal scaling for configuration interaction, Hartree-Fock, density functional, and molecular dynamics is: N6, N4 , N3 and N2 , respectively.
Parallelisation of GAMESS-UKParallelisation of GAMESS-UK● 1985: HF-SCF module initially parallelised using message passing (TCGMSG,
later PVM and MPI) - for iPSC class machines
● 1990s: Use global memory through the Global Array Tools (PNNL)– Data objects can be
distributed and accessed without synchronisation• e.g. in-core storage of
integrals – enables a wider range of
algorithms to be tackled • e.g. parallel MP2, mapping
I/O to memory access ...– Core of the program still uses
a replicated data strategy • node memory limits the
maximum system size
Single, shared data structure
Physically distributed data
● 2003: Implementation of a partial distributed data strategy– New F90 matrix module, mapping most of the matrices to global memory
• can be re-used for other codes, e.g. Crystal– Standards-based (MPI, ScaLAPACK)
Self Consistent Field Method Self Consistent Field Method (SCF)(SCF)
● SCF Theory – each electron interacts with a mean potential created by the other (N-1) electrons
● SCF method derived by assuming a specific form of the solution to the QM equation – the Schrodinger equation (HF theory) or the Kohn-Sham equation (DFT theory), leading to a set of coupled equations (set of integro-differential equations that could be solved numerically).
● More common to expand the solutions in a finite set of primitive functions (the basis set). Set of equations become a set of coupled homogeneous equations that are usually written in matrix form.
● Eigen values and eigen vectors of the matrix describing particle interactions are required; because of coupling in the matrix (matrix defined in terms of its solutions) a self-consistent solution is sought.
● This implies an iterative process, iterating until the Fock matrix remains constant from iteration to iteration.
Schematic SCFSchematic SCFMOAO represents the molecular orbitalsP the density matrix and F the Fock or Hamiltonian matrix
MOAO
P
dgemm
Integrals
V XC
V Coul
V 1e
Sequential Eigensolver
F
guess orbitals
If Converged
Computationally most expensive stepsIntegral generation O(N4) to O(N2.5)Exchange-Correlation Quadrature (DFT only) O(N)Fock Matrix construction from Integrals O(N4) to O(N2.5)Diagonalisation O(N3), Orthogonalisation O(N3)Matrix Multiply O(N3)
F=H0+P[()-1/2()]
2-Electron Integral Generation 2-Electron Integral Generation LoopLoop
Integrals are computed in a 4-nested loop over basis function shells– Basis functions within a shell share exponents
do ish=1, nsh
do jsh = 1, ish
do k = 1, ish
do l = 1, ksh
Compute batch of integrals
Store (conventional SCF)
or
Multiply by P to get F (direct SCF)
enddo
enddo
enddo
enddo
F=H0+P[()-1/2()]
Parallelisation with static task Parallelisation with static task allocationallocation
Use node number to assign tasks
do ish=1, nshells
do jsh = 1, ish
do ksh = 1, ish
do lsh = 1, ksh
if (mod(nnodes,lsh) .eq. mynode) then
Compute batch of integrals
Multiply integrals by P to get F
endif
enddo
enddo
enddo
Enddo
Call global_sum (F)
do ish=1, nshells
do jsh = 1, ish
do ksh = 1, ish
if (mod(nnodes,ksh) .eq. mynode) then
do lsh = 1, ksh
Compute batch of integrals
Multiply integrals by P to get F
enddo
endif
enddo
enddo
Enddo
Call global_sum (F)
Parallel SCF considerationsParallel SCF considerations
● Conventional SCF, we are creating one file on each node. Some logic is needed to support this (GAMESS-UK uses special file names such as ed2000, ed2001….. etc
● Not all tasks are the same size
– If ish, jsh, ksh, lsh are indices of shells comprising a single function, there is only a single integral
– If ish, jsh, ksh, lsh are indices of shells containing Px,Py,Pz
there are potentially 3x3x3x3 integrals (some equivalent by symmetry).
– Even more for d, f, g orbitals
● All nodes must wait for global sum at the end
– “slowest” node controls speed of execution– Dynamic allocation of tasks can reduce load imbalance
Dynamically Load Balanced Dynamically Load Balanced SCFSCF
itask = next_task()
icount = 0
do ish=1, nsh
do jsh = 1, ish
do ksh = 1, ish
do lsh = 1, ksh
if (icount .eq. itask) then
Compute batch of integrals
Multiply integrals by P to get F
itask = next_task()
endif
icount = icount + 1
enddo
enddo
enddo
Enddo
Call global_sum (F)
Implementing dynamic load Implementing dynamic load balancingbalancing
● Some toolkits provide a “global counter” – e.g. GAMESS-UK was first parallelised using TCGMSG,
which provides NXTVAL() call• Implementation is quite machine dependent
● When using MPI-1, an additional task can be assigned to hold the counter and reply to incoming requests– Quite wasteful for small node counts– Basis of GAMESS-UK dynamic load balanced MPI version
● Dynamic allocation works better if large tasks are encountered before small ones– More efficient to reverse loop orderings in integral
generation• generally better to use g, f, d, p, s ordering of shells
rather than s, p, d, f, g….
Further Parallelisation of the SCF Further Parallelisation of the SCF ProcessProcess
Parallel Diagonalisation
– PEiGS: G.I.Fann, R.J. Littlefield. Parallel inverse iteration with reorthogonalisation, paper presented at the Conference on Parallel Processing for Scientific Computing, SIAM, pp. 409-413
– Solves dense real symmetric problems (Ax=lx) and generalised (Ax=lBx) eigen problems.
– Numerical method used is multisection for eigenvalues and repeated inverse iteration and orthogonalisation for eigenvectors.
– Guarantees orthogonality of eigenvectors, even for arbitrarily large clusters that span processors.
– ScaLAPACK:• PDSYEVX, PDSYEVD, PDSYEV – a variety of
routines and algorithms
Symmetric Eigensolver Routines
Parallel Linear AlgebraParallel Linear Algebra
● ScaLAPACK– drivers for solving standard and generalized dense
symmetric or dense Hermitian Eigenproblems. – PDSYEV (QR Method) (Scalapack 1.5)– PDSYEVX (Bisection & Inverse Iteration) (Scalapack
1.5)– PDSYEVD (D&C Method) (Scalapack (1.7)
● BFG (I. Bush)– Block Jacobi Method on full dense symmetric
matrix (+ Hermitian)● Plapack
– QR method– MRRR ‘Multiple Relatively Robust Representations’
Parallel Diagonalisation - Scalability of Parallel Diagonalisation - Scalability of AlgorithmsAlgorithms
Real symmetric eigenvalue problemsFock Matrix, N = 1152
0
0.5
1
1.5
2
8 16 32 64
Number of Processors
Tim
e (
secs)
Peigs 2.1 (PDSPEV) Scalapack (PDSYEV)
Scalapack (PDSYEVD) BFG
Parallel EigensolversParallel Eigensolvers
PDSYEVD Performance for Fock Matrix, (CRYSTAL), N_basis = 3888
0
2
4
6
8
10
12
14
16
16 32 64 128 256 512
Number of Processors
To
tal
Tim
e (s
eco
nd
s)
.
IBM p690 IBM p690+ IBM p5-575
Further Parallelisation of the SCF Further Parallelisation of the SCF ProcessProcess
Algorithm Changes - Alternatives to diagonalisation
– The Hartree-Fock (HF) module may also be based on a quadratically convergent SCF (QCSCF) approach [G.B. Bacskay, Chem. Phys. 61 (182) 385].
– The SCF equations are recast as a non-linear minimisation which bypasses the diagonalisation step. This scheme consists of only data-parallel operations and matrix multiplications which guarantees high efficiency on parallel machines.
– Perhaps more significantly, QCSCF is amenable to several performance enhancements that are not possible in conventional approaches. e.g., • orbital-Hessian vector products may be
computed approximately which significantly reduces the computational expense with no effect on the final accuracy.
I/O ConsiderationsI/O Considerations● Most ab-initio programs rely heavily on I/O
– Integral files (for conventional SCF only)– GAMESS-UK reduces its memory footprint by saving data to disk
when not in use (ed3, ed7)– Data saved for restarting and later job steps (ed3)
● I/O in Parallel Implementations(i) All nodes maintain a copy of the file
– Good for machines with fast local disk on nodes– A limitation on machines with distributed file systems
(e.g. HPCx)(ii) Node 0 maintains a copy of the file on behalf of the parallel job
– Keeps I/O to a minimum– Each read operation has to be followed by a broadcast– This is the default for the MPI version of GAMESS-UK
– The files can be mapped into memory– Particularly useful when each node maintains a partial
copy (e.g. ed2)– Aggregate memory of parallel computer can then be
exploited
Characteristics of Integral-Parallel Characteristics of Integral-Parallel SCFSCF
● Very efficient for small node counts
– integral cost dominates
● Limited by Amdahls law
– Still a lot of serial code in the SCF that needs to be addressed (MxM, diagonalisation…)
● I/O and memory demands of the program have not changed
– considerable effort may be needed to make this efficient
● As an example, consider performance of GAMESS-UK MPI implementation
Parallelisation of GAMESS-UKParallelisation of GAMESS-UK
● Consider the performance of the MPI code.
● The impact of increasing molecule size and the balance between integral evaluation (parallelised) and the SCF steps (serial) in the MPI code. DFT calculations on;
– Morphine (6-31Gdp), 410 basis functions
– Cyclosporin (6-31G), 1000 basis functions, and
– Cyclosporin (6-31Gdp), 1855 basis functions
● All calculations performed on HPCx (Phase2A, p5-575 nodes)
Morphine (6-31Gdp), 410 basis Morphine (6-31Gdp), 410 basis functionsfunctions
150
43
13
207
75
23 20
119
38
1321
72
21
8
23
53
0
50
100
150
200
2e-ints XC SCF Total
16 CPUs
32 CPUs
64 CPUs
128 CPUs
Total Time (secs)
Computational Step
Morphine (6-31Gdp), 410 basis Morphine (6-31Gdp), 410 basis functionsfunctions
% Contribution of Computational Tasks
No. of Processors
15075
38
21
43
23
13
8
1320
21
23
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
16 32 64 128
SCF
XC
2e-ints
Cyclosporin (6-31G), 1000 basis Cyclosporin (6-31G), 1000 basis functionsfunctions
Total Time (secs)
Computational Step
794
251
146
1195
126163
72
175118
38
188
349
697
404
483
231
0
200
400
600
800
1000
1200
2e-ints XC SCF Total
16 CPUs
32 CPUs
64 CPUs
128 CPUs
% Contribution of Computational Tasks
No. of Processors
Cyclosporin (6-31G), 1000 basis Cyclosporin (6-31G), 1000 basis functionsfunctions
794404
231
118
251
126
72
38
163
175
188
146
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
16 32 64 128
SCF
XC
2e-ints
Cyclosporin (6-31Gdp), 1855 basis Cyclosporin (6-31Gdp), 1855 basis functionsfunctions
Total Time (secs)
Computational Step
1988
214
3377
126
1211
71
11631045
2395
547
1825
1192
0
1000
2000
3000
2e-ints XC SCF Total
32 CPUs
64 CPUs
128 CPUs
1988
1045
547
214
126
71
1163
1211
1192
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
32 64 128
SCF
XC
2e-ints
% Contribution of Computational Tasks
No. of Processors
Cyclosporin (6-31Gdp), 1855 basis Cyclosporin (6-31Gdp), 1855 basis functionsfunctions
Task FarmingTask Farming
● Trivial Parallelism– processors are divided up into groups– each group works on an independent task
• requires scalability to smaller processor counts
● e.g. GAMESS-UK– task farm version has been implemented
using MPI features (groups, communicators)– used for combinatorial applications, e.g.
Genetic algorithm design of catalysts.
Task Farming Example (QDVE)Task Farming Example (QDVE)
● QDVE on HPCx is a collaboration with Marcus C. Durrant, formerly John Innes Centre, now Northumbria University.– Use a genetic algorithm to find the most effective transition
metal complex for catalysing the reduction of N2 to N2H4.
– For each potential catalyst, reaction energies for each step in the catalytic cycle are calculated and the most successful complexes go through to the next ‘round’.
– Complexes are ‘bred’ and ‘mutated’ to create new complexes that will hopefully combine the most successful attributes of their parents.
– Process repeated though a number of successive generations until a complex with a desired efficacy is found.
● Implementation:– Reaction energies are calculated in ‘taskfarming mode’ i.e.
numerous small child jobs are run concurrently and in parallel on a subset of the total no. of processors allocated to the parent job.
The Catalytic ComplexesThe Catalytic ComplexesEight geometries
M
L2
S
L1L1
L1 L1 M
S
L2
L1
L1 L1 M L2L1
L1
S
M
S
L1L1L2 L2
M
S
L2L1
L1 L2
M
S
L1L2
L1 L2
L2
M
S
L2L1L1
M
S
L2L1
L1
(1) (2)
(3)
(4) (5) (6)
(8)(7)
L1= BH2-, CH3
-, NH2-, OH-, AlH2
-, SiH3-, PH2, SH-, GaH2
-, GeH3-,AsH2
-, SeH-, NH3, OH2, PH3, SH2, AsH3, SeH2
L2=H-, N3-, O2-, S2-, BH2-, CH3
-, NH2-, OH-, F-, AlH2
-, SiH3-,Ph2
-, SH-, Cl-, GaH2
-, GeH3-, AsH2
-, SeH- Br-, NH3, OH2, PH3, SH2,AsH3, SeH2
M=Transition Metal
S=Reaction Substrate
Each complex consists of:
- Core structure – metal M plus set of ligands L1 & L2;
- Substrate ligand, representing the different species shown in the cycle.
The Catalytic CycleThe Catalytic CycleM
NHH
NM H
M N N
H
H
M N
H
N
H
H
H
M N N
M N NH
M N N
H
M N N
H
H
N
N
H
M
H
+N2(1)
(2)
(3)(4)
(5)
(6)
H+/e-
H+/e-H+/e-
H+/e-
-N2H4
not
or
not
(A)
(B)
(C) (D)
(E)
(F)
(G) (H)
(I)
D8=EA+EN2H4-EI+10
1 B A N2D =E -E -E +15
D2 = EC-EB-E1/2H2-10
D3=EC-ED+10
D4=(EE or EF)-EC-E1/2H2-10
D5=EG-(EE or EF)-E1/2H2-10
D7=EI-EG-E1/2H2-10
D6=EG-EH+10
Spin state.
21143584D2Aidentifies transition metal by specifying the row and column of the periodic table.
Charge on the complex. Unique job
identifier.
Reaction species in the catalytic cycle
The geometry of the substrates and ligand around the metal.
The secondary ligand L2.The primary ligand L1.
NanogenesNanogenesIn molecular evolution, genes are transcribed into functional molecules (proteins) – survival of a gene determined by ability of its associated protein to carry out target chemical reactionIn this project, each complex is described by a nanogene that uniquely identifies each aspect of the complex and also provides a way of automating the breeding and mutation process.
1. Generate initial population of transition metal complexes by random methods.
2. Use GAMESS-UK to calculate the energy of each step in the catalytic cycle.
3. For each complex calculate an overall score (fitness) by comparing the calculated energies with a theoretical ideal value.
4. Use a set of selection rules to determine which complexes should go through to the next round.
5. Are the selection criteria fully satisfied?
QDVE process is completed.
6. Breed and mutate survivors to create the next generation of catalytic complexes.
No
Yes
The Genetic AlgorithmThe Genetic Algorithm
● Execution on large-scale parallel resources requires submission of small number of large jobs - requires task farming harness– dynamically allocate jobs to processors– support for automatic restart if required
● Batch processing of many automatically-generated model transition-metal containing structures presents some problems– Conventional SCF convergence schemes are not typically robust
enough to run without intervention (tuning of level shifters etc)– Modified driver to automatically choose between convergence
schemes based on diagnostics (energy changes, occ-virtual Fock matrix etc).
QDVE - Implementation QDVE - Implementation
root
slaves
master master
slaves slaves
root/master communicator Intra-group
communicator
Group running single GAMESS-UK job
master
● The method involves the definition of a reaction path via replication of a set of macromolecular atoms.
● Entails the simultaneous optimisation of a series of geometries of the reacting system, corresponding to a series of points along the reaction pathway.
● The replica path approach has been tested on the chorismate / prephenate rearrangement, illustrating how the PMF approach, based on the constraint forces acting on the non-equilibrium path structures can be used to extract a measure of the thermodynamics of the reaction from the active site atoms.
● The intermediates can be estimated by interpolation, as in the chorismate mutase reaction:
The Replica Path MethodThe Replica Path Method
PP3636
PP44
PP3232 PP3333
PP11PP00
PP3434 PP3535
PP33PP22
EE
Reaction Co-ordinateReaction Co-ordinate
The Replica Path MethodThe Replica Path Method● Involves minimisation of the reaction path (end points and
e.g. 20 intermediates) at the same time– The target function for the combined minimisation
comprises the sum of the configurational energies, together with a series of penalty functions which ensure that the structures represent the reaction path.
● We can parallelise over images as well as deploy parallel processing with each image
● Effectively exploit 500-1000 processors even with a replicated data code
Replica Path Replica Path ParallelisationParallelisation
● Classical part of the system, (both replicated and non-replicated MM regions), is computed using the standard CHARMM parallel code.
● For the QM calculation, however, the CHARMM communication subsystem is switched such that the processors are grouped into independent sets, each set working on one of the points on the pathway.
● The converged wavefunction for each point is maintained ready to initialise the next calculation.
PP3366
PP44
PP3322
PP3333
PP11
PP00
PP3344
PP3355
PP33
PP22
EE
RReeaaccttiioonn ccoooorrddiinnaattee
H.L. Woodcock, M. Hodoscek, P. Sherwood, Y.S. Lee, H.F. Schaefer III, B.R. Brooks, Theo. Chem. Acc (2003) 109, 140-148.
Test System: Chorismate Test System: Chorismate MutaseMutase
● The Chorismate mutase enzyme is well-studied by both theoretical and experimental methods.
● The system was solvated, leading to a total of ca 1500 atoms. Only one of the active sites in the trimeric enzyme was treated by the replica approach, the remainder by MM.
Chorismate Mutase Test Chorismate Mutase Test SystemSystem
Chorismate/Prephenate moiety , the only part treated by QM methods (thus avoiding any bonded QM/MM junctions
Replicated part of the system (6 Å cutoff) – with a different geometry at each point on the reaction path - is highlighted.
● Solvated system, leading to a total of ca 1500 atoms. Only one of the active sites in the trimeric enzyme was treated by the replica approach, the remainder by MM.
● The Chorismate to Prephenate rearrangement found to have H†† and Hrxn values of 14.9 and -19.5 kcal/mol. The activation enthalpy compares favourably with the expt.value of 12.7±0.4 kcal/mol.
● Close agreement between the energy profiles obtained from direct energetic analysis and from the PMF integration approach.
Computed Energy ProfilesComputed Energy Profiles
The QM/MM Modelling The QM/MM Modelling ApproachApproach
● Couple QM (quantum mechanics) and MM (molecular mechanics) approaches
● QM treatment of the active site– reacting centre– excited state processes
(e.g. spectroscopy)– problem structures (e.g.
complex transition metal centre)
● Classical MM treatment of environment– enzyme structure– zeolite framework– explicit solvent molecules– bulky organometallic
ligands
SummarySummary
● The GAMESS-UK package
– Parallelisation of the SCF process• Static load balancing• Dynamic load balancing
– Parallelising Linear Algebra– Dealing with I/O requirements, mapping files to disk– Observed performance of the MPI version– Two-level Parallelism
• Task Farming in genetic algorithms• Reaction Path methods