33
Evolution of Computational Approaches in NWChem Eric J. Bylaska , W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development Molecular Science Computing Facility

Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Embed Size (px)

Citation preview

Page 1: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Evolution of Computational Approaches in NWChem

Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang

High Performance Software DevelopmentMolecular Science Computing Facility

Page 2: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Outline

Overview of the capabilities of NWChem

Existing terascalepetascale simulations in NWChem

Challenges with developing massively parallel algorithm (that work)

Using embeddable languages (Python) to develop multifaceted simulations

Summary

Page 3: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Overview of NWChem: Background

NWChem is part of the Molecular Science Software SuiteDeveloped as part of the construction of EMSL

Environmental Molecular Sciences Laboratory, a DOE BER user facility, located at PNNL

Designed and developed to be a highly efficient and portable Massively Parallel computational chemistry packageProvides computational chemistry solutions that are scalable with respect to chemical system size as well as MPP hardware size

Page 4: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Overview of NWChem: Background

More than 2,000,000 lines of code (mostly FORTRAN)About half of it computer generated (Tensor Contraction Engine)

A new version is released on a yearly basis: Addition of new modules and bug fixesPorted the software to new hardwareIncreased performances (serial & parallel)

Freely available after signing a user agreement

World-wide distribution (downloaded by 1900+ groups)70% is academia, rest government labs and industry

Page 5: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Originally designed for parallel architecturesScalability to 1000’s of processors (part even 10,000’s)

NWChem performs well on small compute clusters

Portable – runs on a wide range of computersSupercomputer to Mac or PC with Windows

Including Cray XT4, IBM BlueGene Various operating systems, interconnects, CPUs

Emphasis on modularity and portability

Overview of NWChem: Available on many compute platforms

Page 6: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Overview of NWChem: capabilities

NWChem brings a full suite of methodologies to solve large scientific problems

High accuracy methods (MP2/CC/CI)Gaussian based density functional theoryPlane wave based density functional theoryMolecular dynamics (including QMD)

Will not list those things standard in most computational chemistry codes

Brochure with detailed listing availablehttp://www.emsl.pnl.gov/docs/nwchem/nwchem.html

Page 7: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Overview of NWChem: high accuracy methods

Coupled ClusterClosed shell coupled cluster [CCSD and CCSD(T)] Tensor contraction engine (TCE)

Spin-orbital formalism with RHF, ROHF, UHF referenceCCD, CCSDTQ, CCSD(T)/[T]/t/…, LCCD, LCCSDCR-CCSD(T), LR-CCSD(T)Multireference CC (available soon)EOM-CCSDTQ for excited statesMBPT(2-4), CISDTQ, QCISD CF3CHFO CCSD(T) with TCE:

606 bf, 57 electrons, open-shell

8-water cluster on EMSL computer:1376 bf, 32 elec, 1840 processors

achieves 63% of 11 TFlops

Page 8: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Gas phase DNAenvironment

CR-EOMCCSD(T) 4.76 5.24

5.01

5.79

*n* *

*n

Surface Exciton Energy: 6.35 +/- 0.10 eV (expt)

TDDFT infty 5.81 eV

EOMCCSD infty 6.38 eV

Wavelength CC2 CCSD Expt.

92.33 82.20 76.5±8

1064 94.77 83.62 79±4

532 88.62

C60 molecule in ZM3POL basis set:1080 basis set functions; 24 correlated Electrons; 1.5 billion of T2 amplitudes;240 correlated electrons

Dipole polarizabilities (in Å3)

Overview of NWChem: Latest capabilities of high accuracy

Page 9: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

NWChem capabilities: DFT

Density functional theoryWide range of local and non-local exchange-correlation functionals

Truhlar’s M05, M06 (Yan Zhao)Hartree-Fock exchangeMeta-GGA functionals

Self interaction correction (SIC) and OEPSpin-orbit DFT

ECP, ZORA, DKConstrained DFT

Implemented by Qin Wu and Van Voorhis

DFT performance: Si75O148H66, 3554 bf, 2300 elec, Coulomb fitting

Page 10: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Overview of NWChem: TDDFT

Density functional theoryTDDFT for excited states

Modeling the λmax (absorption maximum) optical chromophores with dicyanovinyl and 3-phenylioxazolone groups with TDDF, B3LYP, COSMO (8.93 → CH2Cl2), 6-31G**, Andzelm (ARL), et al (in preparation 2008)

Expt: 539 nm ; TDDFT: 572 nmExpt: 696 nm ; TDDFT: 730 nm

Development & validation of new Coulomb attenuated (LC) functionals(Andzelm, Baer, Govind, in preparation 2008)

Page 11: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Overview of NWChem: plane wave DFTPlane wave density functional theory

Gamma point pseudopotential and projector augmented waveBand structure (w/ spin orbit)Extensive dynamics functionality with Car-ParrinelloCPMD/MM molecular dynamics, e.g. SPC/E, CLAYFF, solid state MD

Spin-Orbit splitting in GaAs

Silica/Water CPMD Simulation

CCl42-/water CPMD/MM Simulation

Various exchange-correlation functionals

LDA, PBE96, PBE0,(B3LYP)Exact exchangeSIC and OEP

Can handle charged systemsA full range of pseudopotentials and a pseudopotential generator A choice of state-of-the-art minimizers

Page 12: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Overview of NWChem: classical molecular dynamics

Molecular dynamicsCharm and Amber force fields Various types of simulations:

Energy minimizationMolecular dynamics simulation including ab initio dynamicsFree energy calculation Multiconfiguration thermodynamic integration

Electron transfer through proton hopping (Q-HOP), i.e. semi-QM in classical MD

Implemented by Volkhard group, University of Saarland, Germany

Set up and analyze runs with Ecce

Page 13: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

MMQM

Overview of NWChem: QM/MM

Seamless integration of molecular dynamics with coupled cluster and DFT

Optimization and transition state searchQM/MM Potential of Mean Force (PMF) Modeling properties at finite temperature

Excited States with EOM-CCPolarizabilities with linear response CCNMR chemical shift with DFT QM/MM CR-EOM-CCSD Excited

State Calculations of cytosine base in DNA, Valiev et al., JCP 125 (2006)

QM/MM

Page 14: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Overview of NWChem: Miscellaneous functionality

Other functionality available in NWChemElectron transferVibrational SCF and DFT for anharmonicityCOSMOONIOMRelativity through spin-orbit ECP, ZORA, and DKNMR shielding and indirect spin-spin coupling

Interface with VENUS for chemical reaction dynamicsInterface with POLYRATE, Python

Page 15: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Existing TerascalePetascale capabilities

NWChem has several modules which are scaling to the terascale and work is ongoing to approach the petascale

0

250

500

750

1000

0 250 500 750 1000

Speedup

Number of nodes

Cray T3E / 900

CCSD scalability: C60

1080 basis set functions MD scalability:

Octanol (216,000 atoms)

NWPW scalability: UO22+

+122H2O (Ne=500, Ng=96x96x96)

Page 16: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Existing terascalepetascale Capabilities: Embarrassingly Parallel Algorithms

Some types of problems can be decomposed and executed in parallel with virtually no need for tasks to share data. These types of problems are often called embarrassingly parallel because they are so straight-forward. Very little inter-task (or no) communication is required.

Possible to use embeddable languages such as Python to implement

Examples:Computing Potential Energy SurfacesPotential of Mean Force (PMF)Monte-CarloQM/MM Free Energy PerturbationDynamics Nucleation TheoryOniom….

Page 17: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Challenges: Tackling Amdahl's Law

p

p

pNf

N

p

NT

N

Nf

fNffTp

pp

s

1

1

1

1

1

1

ff

1-f1-f

My ProgramParallel Efficiency

sec 9.6hrs 24*2%0.00011122%9888778.999.0,1000

min 1.62hrs 24*0.001122%%8877666.999.0,100

sec 8.64hrs 24*0.01%%99.99,10000

min 1.4hrs 24*0.1%%9.99,1000

min 14hrs 24*1%%99,100

21

21

21

p

p

p

p

p

Nf

Nf

Nf

Nf

Nf

There is a limit to the performance gain we can expect from parallelization

Significant time investment for each 10-fold increase in parallelism!

Page 18: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Challenges: Gustafson’s Law – make the problem big

))(*)((

))()((

nbPnaT

nbnaT

s

p

)1

1(*)(1

))(1(*)())()((

))(*)((

Pna

naPnanbna

nbPna

T

TS

p

s

b(n)b(n)

a(n)a(n)

My parallel Program

1)()( and )(lim nbnasmallnanAssuming

Then

Gustafson's Law (also known as Gustafson-Barsis' law) states that any sufficiently large problem can be efficiently parallelized

Page 19: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Challenges: Extending Time – Failure of Gustafson’s Law?

The need for up-scaling in time is especially critical for classical molecular dynamics simulations and ab-initio molecular dynamics, where an up-scaling of about a 1000 times will be needed.

Current ab-initio molecular dynamics simulations done over several months are currently done only over 10 to 100 picoseconds, and the processes of interest are on the order of nanoseconds.

The step length in ab initio molecular dynamics simulation is on the order of 0.1…0.2 fs/step

1 ns of simulation time 10,000,000 stepsat 1 second per step 115 days of computing timeAt 10 seconds per step 3 yearsAt 30 seconds per step 9 years

Classical molecular dynamics simulations done over several months, are only able simulate between 10 to 100 nanoseconds, but many of the physical processes of interest are on the order of at least a microsecond.

The step length in molecular dynamics simulation is on the order of 1…2 fs/step

1 us of simulation time 1,000,000,000 stepsat 0.01 second per step 115 days of computing timeAt 0.1 seconds per step 3 yearsAt 1 seconds per step 30 years

Page 20: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Challenges: Need For Communication Tolerant Algorithms

Data motion costs increasing due to “processor memory gap”

Coping strategies

Trade computation for communication [SC03]

Overlap Computation and Communication

Speedup for Car-Parrinello Code

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

0 10 20 30 40 50 60 70number of nodes

speedup

T3E IBM-SP

1980 1985 1990 1995 2000 200510

0

101

102

103

104

105

Year

Pe

rfo

rma

nce

Memory [DRAM](7%/year)

Processor

(55%/year)

Scaling of local basis DFT calculation

0100200300400500600700800900

100011001200

8 16 32 64 96 128

Processors

Tim

e (

s)

Page 21: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Basic Parallel Computing: tasks

Break program into tasks

A task is a coherent piece of work (such as Reading data from keyboard or some part of a computation), that is to be done sequentially; it can not be further divided into other tasks and thus can not be parallellised.

Page 22: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Basic Parallel Computing: Task dependency graph

a = a + 2;

b = b + 3

for i = 0..3 do

c[i+1] += c[i];

b = b * a;

b = b + c[4]

A 2 3 C[0] C[1]B

C[2]

A 2 3 C[0] C[1]B

C[3]

C[2]

A 2 3 C[0] C[1]B

C[4]

C[3]

C[2]

A 2 3 C[0] C[1]B

C[4]

C[3]

C[2]

A 2 3 C[0] C[1]B

Page 23: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Basic Parallel Computing: Task dependency graph (2)

Synchronisationpoints

Critical path

Page 24: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Basic Parallel Computing: Critical path

The critical path determines the time required to execute a program. No matter how many processors are used, the time imposed by the critical path determines an upper limit to performance.

The first step in developing a parallel algorithm is to parallelize along the critical path

If not enough?

Page 25: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Case Study: Parallel Strategies for Plane-Wave Programs

Ne molecular orbitals

Ng basis ....... Ne

Distribute Molecular Orbitals (one eigenvector per task)

Distribute Brillouin Zone

Three parallization schemesThree parallization schemesDistribute basis (i.e. FFT grid)

<i|j> requires communication FFT requires communication

proc=1proc=2proc=3proc=....

slab decomposition number of k-points is usually small

• Minimal load balancing

Critical path parallelization

Page 26: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Ngrid

=N1xN2xN3

Norbitals

Ngrid

=N1xN2xN3

Norbitals

Old parallel distribution (scheme 3 – critical path)

Case Study: Speeding up plane-wave DFT

Each box represents a cpu

Performance of UO22++122water

1

10

100

1000

10000

1 10 100 1000 10000

Number of CPUs

Tim

e (S

eco

nd

s)

Page 27: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Case Study: Analyze the algorithm

Collect timings for important components of the existing algorithm

FFT Tasks

0.01

0.1

1

10

100

1000

1 10 100 1000 10000

Non-Local PSP Tasks

0.1

1

10

100

1000

10000

1 10 100 1000 10000

Overlap Task

0.01

0.1

1

10

100

1000

1 10 100 1000 10000

Rotation Tasks

0.01

0.1

1

10

100

1000

1 10 100 1000 10000

For bottlenecks Remove all-to-all, log(P) global operations beyond 500 cpusOverlap computation and communication (e.g. pipelining)Duplicate the data? Redistribute the data?….

Page 28: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Ngrid

=N1xN2xN3

Norbitals

Ngrid

=N1xN2xN3

Norbitals

Ngrid

=N1xN2xN3

Norbitals

Ngrid

=N1xN2xN3

Norbitals

Old parallel distribution New parallel distribution

Case Study: Try a new strategy

Page 29: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Case Study: analyze new algorithm

0.01

0.1

1

10

100

1000

1 10 100 1000 10000

nj=1nj=2nj=4nj=8nj=16nj=32nj=1nj=2nj=4nj=8nj=16nj=32

0.1

1

10

100

1000

10000

1 10 100 1000 10000

nj=1

nj=2

nj=4

nj=8

nj=16

nj=32

0.01

0.1

1

10

100

1000

1 10 100 1000 10000

nj=1

nj=2

nj=4

nj=8

nj=16

nj=32

0.01

0.1

1

10

100

1000

1 10 100 1000 10000

nj=1

nj=2

nj=4

nj=8

nj=16

nj=32

Can trade efficiency in one component for efficiency in another component if timings are of different orders

Page 30: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Case Study: Success!

Page 31: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Python Example: AIMD Simulationtitle "propane aimd simulation"

start propane2-dbmemory 600 mbpermanent_dir ./permscratch_dir ./perm

geometry C 1.24480654 0.00000000 -0.25583795 C 0.00000000 0.00000000 0.58345271 H 1.27764005 -0.87801632 -0.90315343 mass 2.0 H 2.15111436 0.00000000 0.34795707 mass 2.0 H 1.27764005 0.87801632 -0.90315343 mass 2.0 H 0.00000000 -0.87115849 1.24301935 mass 2.0 H 0.00000000 0.87115849 1.24301935 mass 2.0 C -1.24480654 0.00000000 -0.25583795 H -2.15111436 0.00000000 0.34795707 mass 2.0 H -1.27764005 -0.87801632 -0.90315343 mass 2.0 H -1.27764005 0.87801632 -0.90315343 mass 2.0endbasis* library 3-21Gendpython from nwmd import * surface = nwmd_run('geometry','dft',10.0, 5000)endtask python

Page 32: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

## import nwchem specific routinesfrom nwchem import *

## other libraries you might want to usefrom math import *from numpy import *from numpy.linalg import *from numpy.fft import *import Gnuplot

####### basic rtdb routines to read and write coordinates ###########################def geom_get_coords(name): # # This routine returns a list with the cartesian # coordinates in atomic units for the geometry # of given name # try: actualname = rtdb_get(name) except NWChemError: actualname = name coords = rtdb_get('geometry:' + actualname + ':coords') return coordsdef geom_set_coords(name,coords): # # This routine, given a list with the cartesian # coordinates in atomic units set them in # the geometry of given name. # try: actualname = rtdb_get(name) except NWChemError: actualname = name coords = list(coords) rtdb_put('geometry:' + actualname + ':coords',coords)

Python Example: Header of nwmd.py

Page 33: Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development

Python Example: basic part of Verlet loop #do verlet step-1 verlet steps for s in range(steps-1): for i in range(nion3): rion0[i] = rion1[i] for i in range(nion3): rion1[i] = rion2[i] t += time_step

### set coordinates and calculate energy – nwchem specific #### geom_set_coords(geometry_name,rion1) (v,fion) = task_gradient(theory)

### verlet step ### for i in range(nion3): rion2[i] = 2.0*rion1[i] - rion0[i] - dti[i]*fion[i]

### calculate ke ### vion1 = [] for i in range(nion3): vion1.append(h*(rion2[i] - rion0[i]))

ke = 0.0 for i in range(nion3): ke += 0.5*massi[i]*vion1[i]*vion1[i] e = v + ke print ' ' print '@@ %5d %9.1f %19.10e %19.10e %14.5e' % (s+2,t,e,v,ke)