Upload
melvyn-garrett
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Evolution of Computational Approaches in NWChem
Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang
High Performance Software DevelopmentMolecular Science Computing Facility
Outline
Overview of the capabilities of NWChem
Existing terascalepetascale simulations in NWChem
Challenges with developing massively parallel algorithm (that work)
Using embeddable languages (Python) to develop multifaceted simulations
Summary
Overview of NWChem: Background
NWChem is part of the Molecular Science Software SuiteDeveloped as part of the construction of EMSL
Environmental Molecular Sciences Laboratory, a DOE BER user facility, located at PNNL
Designed and developed to be a highly efficient and portable Massively Parallel computational chemistry packageProvides computational chemistry solutions that are scalable with respect to chemical system size as well as MPP hardware size
Overview of NWChem: Background
More than 2,000,000 lines of code (mostly FORTRAN)About half of it computer generated (Tensor Contraction Engine)
A new version is released on a yearly basis: Addition of new modules and bug fixesPorted the software to new hardwareIncreased performances (serial & parallel)
Freely available after signing a user agreement
World-wide distribution (downloaded by 1900+ groups)70% is academia, rest government labs and industry
Originally designed for parallel architecturesScalability to 1000’s of processors (part even 10,000’s)
NWChem performs well on small compute clusters
Portable – runs on a wide range of computersSupercomputer to Mac or PC with Windows
Including Cray XT4, IBM BlueGene Various operating systems, interconnects, CPUs
Emphasis on modularity and portability
Overview of NWChem: Available on many compute platforms
Overview of NWChem: capabilities
NWChem brings a full suite of methodologies to solve large scientific problems
High accuracy methods (MP2/CC/CI)Gaussian based density functional theoryPlane wave based density functional theoryMolecular dynamics (including QMD)
Will not list those things standard in most computational chemistry codes
Brochure with detailed listing availablehttp://www.emsl.pnl.gov/docs/nwchem/nwchem.html
Overview of NWChem: high accuracy methods
Coupled ClusterClosed shell coupled cluster [CCSD and CCSD(T)] Tensor contraction engine (TCE)
Spin-orbital formalism with RHF, ROHF, UHF referenceCCD, CCSDTQ, CCSD(T)/[T]/t/…, LCCD, LCCSDCR-CCSD(T), LR-CCSD(T)Multireference CC (available soon)EOM-CCSDTQ for excited statesMBPT(2-4), CISDTQ, QCISD CF3CHFO CCSD(T) with TCE:
606 bf, 57 electrons, open-shell
8-water cluster on EMSL computer:1376 bf, 32 elec, 1840 processors
achieves 63% of 11 TFlops
Gas phase DNAenvironment
CR-EOMCCSD(T) 4.76 5.24
5.01
5.79
*n* *
*n
Surface Exciton Energy: 6.35 +/- 0.10 eV (expt)
TDDFT infty 5.81 eV
EOMCCSD infty 6.38 eV
Wavelength CC2 CCSD Expt.
92.33 82.20 76.5±8
1064 94.77 83.62 79±4
532 88.62
C60 molecule in ZM3POL basis set:1080 basis set functions; 24 correlated Electrons; 1.5 billion of T2 amplitudes;240 correlated electrons
Dipole polarizabilities (in Å3)
Overview of NWChem: Latest capabilities of high accuracy
NWChem capabilities: DFT
Density functional theoryWide range of local and non-local exchange-correlation functionals
Truhlar’s M05, M06 (Yan Zhao)Hartree-Fock exchangeMeta-GGA functionals
Self interaction correction (SIC) and OEPSpin-orbit DFT
ECP, ZORA, DKConstrained DFT
Implemented by Qin Wu and Van Voorhis
DFT performance: Si75O148H66, 3554 bf, 2300 elec, Coulomb fitting
Overview of NWChem: TDDFT
Density functional theoryTDDFT for excited states
Modeling the λmax (absorption maximum) optical chromophores with dicyanovinyl and 3-phenylioxazolone groups with TDDF, B3LYP, COSMO (8.93 → CH2Cl2), 6-31G**, Andzelm (ARL), et al (in preparation 2008)
Expt: 539 nm ; TDDFT: 572 nmExpt: 696 nm ; TDDFT: 730 nm
Development & validation of new Coulomb attenuated (LC) functionals(Andzelm, Baer, Govind, in preparation 2008)
Overview of NWChem: plane wave DFTPlane wave density functional theory
Gamma point pseudopotential and projector augmented waveBand structure (w/ spin orbit)Extensive dynamics functionality with Car-ParrinelloCPMD/MM molecular dynamics, e.g. SPC/E, CLAYFF, solid state MD
Spin-Orbit splitting in GaAs
Silica/Water CPMD Simulation
CCl42-/water CPMD/MM Simulation
Various exchange-correlation functionals
LDA, PBE96, PBE0,(B3LYP)Exact exchangeSIC and OEP
Can handle charged systemsA full range of pseudopotentials and a pseudopotential generator A choice of state-of-the-art minimizers
Overview of NWChem: classical molecular dynamics
Molecular dynamicsCharm and Amber force fields Various types of simulations:
Energy minimizationMolecular dynamics simulation including ab initio dynamicsFree energy calculation Multiconfiguration thermodynamic integration
Electron transfer through proton hopping (Q-HOP), i.e. semi-QM in classical MD
Implemented by Volkhard group, University of Saarland, Germany
Set up and analyze runs with Ecce
MMQM
Overview of NWChem: QM/MM
Seamless integration of molecular dynamics with coupled cluster and DFT
Optimization and transition state searchQM/MM Potential of Mean Force (PMF) Modeling properties at finite temperature
Excited States with EOM-CCPolarizabilities with linear response CCNMR chemical shift with DFT QM/MM CR-EOM-CCSD Excited
State Calculations of cytosine base in DNA, Valiev et al., JCP 125 (2006)
QM/MM
Overview of NWChem: Miscellaneous functionality
Other functionality available in NWChemElectron transferVibrational SCF and DFT for anharmonicityCOSMOONIOMRelativity through spin-orbit ECP, ZORA, and DKNMR shielding and indirect spin-spin coupling
Interface with VENUS for chemical reaction dynamicsInterface with POLYRATE, Python
Existing TerascalePetascale capabilities
NWChem has several modules which are scaling to the terascale and work is ongoing to approach the petascale
0
250
500
750
1000
0 250 500 750 1000
Speedup
Number of nodes
Cray T3E / 900
CCSD scalability: C60
1080 basis set functions MD scalability:
Octanol (216,000 atoms)
NWPW scalability: UO22+
+122H2O (Ne=500, Ng=96x96x96)
Existing terascalepetascale Capabilities: Embarrassingly Parallel Algorithms
Some types of problems can be decomposed and executed in parallel with virtually no need for tasks to share data. These types of problems are often called embarrassingly parallel because they are so straight-forward. Very little inter-task (or no) communication is required.
Possible to use embeddable languages such as Python to implement
Examples:Computing Potential Energy SurfacesPotential of Mean Force (PMF)Monte-CarloQM/MM Free Energy PerturbationDynamics Nucleation TheoryOniom….
Challenges: Tackling Amdahl's Law
p
p
pNf
N
p
NT
N
Nf
fNffTp
pp
s
1
1
1
1
1
1
ff
1-f1-f
My ProgramParallel Efficiency
sec 9.6hrs 24*2%0.00011122%9888778.999.0,1000
min 1.62hrs 24*0.001122%%8877666.999.0,100
sec 8.64hrs 24*0.01%%99.99,10000
min 1.4hrs 24*0.1%%9.99,1000
min 14hrs 24*1%%99,100
21
21
21
p
p
p
p
p
Nf
Nf
Nf
Nf
Nf
There is a limit to the performance gain we can expect from parallelization
Significant time investment for each 10-fold increase in parallelism!
Challenges: Gustafson’s Law – make the problem big
))(*)((
))()((
nbPnaT
nbnaT
s
p
)1
1(*)(1
))(1(*)())()((
))(*)((
Pna
naPnanbna
nbPna
T
TS
p
s
b(n)b(n)
a(n)a(n)
My parallel Program
1)()( and )(lim nbnasmallnanAssuming
Then
Gustafson's Law (also known as Gustafson-Barsis' law) states that any sufficiently large problem can be efficiently parallelized
Challenges: Extending Time – Failure of Gustafson’s Law?
The need for up-scaling in time is especially critical for classical molecular dynamics simulations and ab-initio molecular dynamics, where an up-scaling of about a 1000 times will be needed.
Current ab-initio molecular dynamics simulations done over several months are currently done only over 10 to 100 picoseconds, and the processes of interest are on the order of nanoseconds.
The step length in ab initio molecular dynamics simulation is on the order of 0.1…0.2 fs/step
1 ns of simulation time 10,000,000 stepsat 1 second per step 115 days of computing timeAt 10 seconds per step 3 yearsAt 30 seconds per step 9 years
Classical molecular dynamics simulations done over several months, are only able simulate between 10 to 100 nanoseconds, but many of the physical processes of interest are on the order of at least a microsecond.
The step length in molecular dynamics simulation is on the order of 1…2 fs/step
1 us of simulation time 1,000,000,000 stepsat 0.01 second per step 115 days of computing timeAt 0.1 seconds per step 3 yearsAt 1 seconds per step 30 years
Challenges: Need For Communication Tolerant Algorithms
Data motion costs increasing due to “processor memory gap”
Coping strategies
Trade computation for communication [SC03]
Overlap Computation and Communication
Speedup for Car-Parrinello Code
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
0 10 20 30 40 50 60 70number of nodes
speedup
T3E IBM-SP
1980 1985 1990 1995 2000 200510
0
101
102
103
104
105
Year
Pe
rfo
rma
nce
Memory [DRAM](7%/year)
Processor
(55%/year)
Scaling of local basis DFT calculation
0100200300400500600700800900
100011001200
8 16 32 64 96 128
Processors
Tim
e (
s)
Basic Parallel Computing: tasks
Break program into tasks
A task is a coherent piece of work (such as Reading data from keyboard or some part of a computation), that is to be done sequentially; it can not be further divided into other tasks and thus can not be parallellised.
Basic Parallel Computing: Task dependency graph
a = a + 2;
b = b + 3
for i = 0..3 do
c[i+1] += c[i];
b = b * a;
b = b + c[4]
A 2 3 C[0] C[1]B
C[2]
A 2 3 C[0] C[1]B
C[3]
C[2]
A 2 3 C[0] C[1]B
C[4]
C[3]
C[2]
A 2 3 C[0] C[1]B
C[4]
C[3]
C[2]
A 2 3 C[0] C[1]B
Basic Parallel Computing: Task dependency graph (2)
Synchronisationpoints
Critical path
Basic Parallel Computing: Critical path
The critical path determines the time required to execute a program. No matter how many processors are used, the time imposed by the critical path determines an upper limit to performance.
The first step in developing a parallel algorithm is to parallelize along the critical path
If not enough?
Case Study: Parallel Strategies for Plane-Wave Programs
Ne molecular orbitals
Ng basis ....... Ne
Distribute Molecular Orbitals (one eigenvector per task)
Distribute Brillouin Zone
Three parallization schemesThree parallization schemesDistribute basis (i.e. FFT grid)
<i|j> requires communication FFT requires communication
proc=1proc=2proc=3proc=....
slab decomposition number of k-points is usually small
• Minimal load balancing
Critical path parallelization
Ngrid
=N1xN2xN3
Norbitals
Ngrid
=N1xN2xN3
Norbitals
Old parallel distribution (scheme 3 – critical path)
Case Study: Speeding up plane-wave DFT
Each box represents a cpu
Performance of UO22++122water
1
10
100
1000
10000
1 10 100 1000 10000
Number of CPUs
Tim
e (S
eco
nd
s)
Case Study: Analyze the algorithm
Collect timings for important components of the existing algorithm
FFT Tasks
0.01
0.1
1
10
100
1000
1 10 100 1000 10000
Non-Local PSP Tasks
0.1
1
10
100
1000
10000
1 10 100 1000 10000
Overlap Task
0.01
0.1
1
10
100
1000
1 10 100 1000 10000
Rotation Tasks
0.01
0.1
1
10
100
1000
1 10 100 1000 10000
For bottlenecks Remove all-to-all, log(P) global operations beyond 500 cpusOverlap computation and communication (e.g. pipelining)Duplicate the data? Redistribute the data?….
Ngrid
=N1xN2xN3
Norbitals
Ngrid
=N1xN2xN3
Norbitals
Ngrid
=N1xN2xN3
Norbitals
Ngrid
=N1xN2xN3
Norbitals
Old parallel distribution New parallel distribution
Case Study: Try a new strategy
Case Study: analyze new algorithm
0.01
0.1
1
10
100
1000
1 10 100 1000 10000
nj=1nj=2nj=4nj=8nj=16nj=32nj=1nj=2nj=4nj=8nj=16nj=32
0.1
1
10
100
1000
10000
1 10 100 1000 10000
nj=1
nj=2
nj=4
nj=8
nj=16
nj=32
0.01
0.1
1
10
100
1000
1 10 100 1000 10000
nj=1
nj=2
nj=4
nj=8
nj=16
nj=32
0.01
0.1
1
10
100
1000
1 10 100 1000 10000
nj=1
nj=2
nj=4
nj=8
nj=16
nj=32
Can trade efficiency in one component for efficiency in another component if timings are of different orders
Case Study: Success!
Python Example: AIMD Simulationtitle "propane aimd simulation"
start propane2-dbmemory 600 mbpermanent_dir ./permscratch_dir ./perm
geometry C 1.24480654 0.00000000 -0.25583795 C 0.00000000 0.00000000 0.58345271 H 1.27764005 -0.87801632 -0.90315343 mass 2.0 H 2.15111436 0.00000000 0.34795707 mass 2.0 H 1.27764005 0.87801632 -0.90315343 mass 2.0 H 0.00000000 -0.87115849 1.24301935 mass 2.0 H 0.00000000 0.87115849 1.24301935 mass 2.0 C -1.24480654 0.00000000 -0.25583795 H -2.15111436 0.00000000 0.34795707 mass 2.0 H -1.27764005 -0.87801632 -0.90315343 mass 2.0 H -1.27764005 0.87801632 -0.90315343 mass 2.0endbasis* library 3-21Gendpython from nwmd import * surface = nwmd_run('geometry','dft',10.0, 5000)endtask python
## import nwchem specific routinesfrom nwchem import *
## other libraries you might want to usefrom math import *from numpy import *from numpy.linalg import *from numpy.fft import *import Gnuplot
####### basic rtdb routines to read and write coordinates ###########################def geom_get_coords(name): # # This routine returns a list with the cartesian # coordinates in atomic units for the geometry # of given name # try: actualname = rtdb_get(name) except NWChemError: actualname = name coords = rtdb_get('geometry:' + actualname + ':coords') return coordsdef geom_set_coords(name,coords): # # This routine, given a list with the cartesian # coordinates in atomic units set them in # the geometry of given name. # try: actualname = rtdb_get(name) except NWChemError: actualname = name coords = list(coords) rtdb_put('geometry:' + actualname + ':coords',coords)
Python Example: Header of nwmd.py
Python Example: basic part of Verlet loop #do verlet step-1 verlet steps for s in range(steps-1): for i in range(nion3): rion0[i] = rion1[i] for i in range(nion3): rion1[i] = rion2[i] t += time_step
### set coordinates and calculate energy – nwchem specific #### geom_set_coords(geometry_name,rion1) (v,fion) = task_gradient(theory)
### verlet step ### for i in range(nion3): rion2[i] = 2.0*rion1[i] - rion0[i] - dti[i]*fion[i]
### calculate ke ### vion1 = [] for i in range(nion3): vion1.append(h*(rion2[i] - rion0[i]))
ke = 0.0 for i in range(nion3): ke += 0.5*massi[i]*vion1[i]*vion1[i] e = v + ke print ' ' print '@@ %5d %9.1f %19.10e %19.10e %14.5e' % (s+2,t,e,v,ke)