27
Tensor contraction engine & extensible many-electron theory module in NWChem So Hirata Pacific Northwest National Laboratory MSS group meeting (24 Oct, 2002)

Tensor contraction engine & extensible many-electron theory module in NWChem So Hirata Pacific Northwest National Laboratory MSS group meeting (24 Oct,

Embed Size (px)

Citation preview

Tensor contraction engine& extensible many-

electron theory module in NWChem

Tensor contraction engine& extensible many-

electron theory module in NWChem

So HirataPacific Northwest National

Laboratory

MSS group meeting (24 Oct, 2002)

2

Collaborators & SponsorsCollaborators & Sponsors

• M. Nooijen (Princeton University)• R. J. Harrison & D. Bernholdt (Oak Ridge National

Laboratory)• D. Cociorva, G. Baumgartner, R. Pitzer, & P.

Sadayappan (Ohio State University)• J. Ramanujam (Louisiana State University)

• Office of Basic Energy Science, Department of Energy

• Office of Biological and Environmental Research, Department of Energy

3

Purpose of this projectPurpose of this project

• Create a high-level symbolic manipulation language that derives working equations of second-quantized many-electron theories and implement them automatically• Expedites complex and error-prone many-

electron theory implementation• Helps develop and examine new theories or

approximations• Facilitates parallelization and other laborious

code optimizations• CCSDT T3 code is >18000 lines in Fortran77!

4

Operator contraction engine (OCE)Operator contraction engine (OCE)

• Object-oriented symbolic manipulation program that derives working equations from any well-defined second-quantized many-electron theory ansatz

• Performs valid contractions of normal-ordered operators according to Wick’s theorem and reduces any given ansatz into the simplest form of tensor contraction expressions

• Consolidates identical terms and recognizes terms that are related by permutation symmetry

5

Tensor contraction engine (TCE)Tensor contraction engine (TCE)

• Object-oriented symbolic manipulation program that analyzes tensor contraction expressions and implement them into efficient programs

• Breaks down multiple tensor contractions (A=BCDE) into a sequence of elementary tensor contractions (X=DE; Y=BX; A=YC) with minimal operation costs

• Factorizes common contractions [X=BC+BD into X=B(C+D)]

• Generates debug-level Fortran90 programs and release-level parallel Fortran77 programs

OCE & TCE demonstration

OCE & TCE demonstration

7

What is new?What is new?

• Full exploitation of index permutation symmetry• Taking advantage of spin and spatial

symmetry also, the programs generated by TCE are theoretically operation cost minimal

• OCE extracts permutation symmetries among working equations

• TCE breaks down permutation operators into elementary permutation operators, analyzes which permutation symmetries can be exploited, and reflects the result to the generated codes

8

Permutation symmetryPermutation symmetry

• Primitive tensors that appear in many-electron theories possess “permutation anti-symmetry.” For example, a two-electron integral tensor and a three-electron excitation amplitude tensor have the following properties: qp

srqprs

pqsr

pqrs vvvv

cbakji

cbakij

cbajki

cbajik

cbaikj

cbaijk

cabkji

cabkij

cabjki

cabjik

cabikj

cabijk

bcakji

bcakij

bcajki

bcajik

bcaikj

bcaijk

backji

backij

bacjki

bacjik

bacikj

bacijk

acbkji

acbkij

acbjki

acbjik

acbikj

acbijk

abckji

abckij

abcjki

abcjik

abcikj

abcijk

tttttt

tttttt

tttttt

tttttt

tttttt

tttttt

9

ImplicationImplication

• Reduced storage size• Instead of storing full , we may keep only

• Reduced operation cost by shorter summation index ranges

• Reduced operation cost by shorter target index ranges• Instead of computing full , we may

obtain just

abijt

bajit

dc

abdc

dcij

dc

abcd

cdij vtvt 2

,

dc

abdc

dcij vt2

dc

badc

dcji vt2

10

ChallengesChallenges

• What is the index permutation symmetry of an intermediate tensor?• Consider the intermediate

• What is the best way to store just the non-redundant elements of tensors?

• What is the operation cost minimal contraction of two tensors with permutation symmetry?

• How can TCE generate a code that exploits spin, spatial, and permutation symmetries at the same time?

bj

ai

abij tti

11

Index permutation symmetry versus permutation symmetry of tensor contraction expressions

Index permutation symmetry versus permutation symmetry of tensor contraction expressions

• Index permutation anti-symmetry ultimately reflects the Pauli principle of fermions; any tensor having electron indices (such as integrals, excitation amplitudes) is anti-symmetric• When there is such a multiple tensor contraction

as

there “must” be also

dnm

mnid

cm

abdjkn

abcijk vtti

,,

mnid

cm

abdjkn

mnid

cm

abdjkn

mnid

cm

abdjkn

mnid

cm

abdjkn

mnid

cm

abdjkn

mnid

cm

abdjkn

mnid

cm

abdjkn

mnid

cm

abdjkn

vttkiPcbPvttkiPcaPvttjiPcbPvttjiPcaP

vttkiPvttjiPvttcbPvttcaP

)()()()()()()()(

)()()()(

12

Break down of permutation operators

Break down of permutation operators

• When breaking down a multiple tensor contraction into a sequence of binary tensor contractions, we should break down the permutation operators appropriately, so that each intermediate has maximum index permutation symmetries

mnid

cm

abdjkn

abcijk vttjkiPabcPr )/()/(

m

abmijk

cm

abcijk

nd

mnid

abdjkn

abmijk

itabcPr

vtjkiPi

)/(

)/(,

13

What is the best way to store an intermediate?

What is the best way to store an intermediate?

• An intermediate tensor has much more limited index permutation symmetries. Super (sub) indices are categorized into global targets and local targets, and permutation anti-symmetry exists among just global targets and among just local targets. So in general, the non-redundant elements are: pn

qm

ggggggggi

321321

321321

,,

14

What is the general form of tensor contraction with permutation

symmetry?

What is the general form of tensor contraction with permutation

symmetry?

• Expansion

Note that an excitation amplitude tensor will not have local target indices. This is because two excitation amplitudes cannot contract (as they have super particles, sub holes

structures).

txn

uym

pn

qm

ccggccgg

gg

gg ii

111

111

11

11

,,,,

,

,

up

tq

n

m

ccgg

ccggaaii tt

11

11

1

1

,

,

15

What is the general form of tensor contraction with permutation

symmetry?

What is the general form of tensor contraction with permutation

symmetry?

• Contraction

Note that at least one of the two tensors is always an excitation amplitude tensor.

pxn

qym

t u

up

tq

txn

uym

gggggggg

cc cc

ccggccgg

ccggccgg

i

titu

111

111

1 1

11

11

111

111

,,,,

,,

,,,,!!

16

What is the general form of tensor contraction with permutation

symmetry?

What is the general form of tensor contraction with permutation

symmetry?

• Compressionpxn

qym

xvpn

ywqm

gggg

gggg

gg

gg iPi

111

111

11

11

,,

,,

,

,

17

Spin & spatial symmetrySpin & spatial symmetry

• Spin symmetry

• Spatial symmetry

indices

subscriptindices

tsuperscrip

qq

pp ss

symmetricTotally zqp

18

An exampleAn example

d

cldi

dbkj

cblkji vtbcPjkiPx /,

LOOP OVER b,j<=k BLOCKS LOOP OVER l,c,i BLOCKS LOOP OVER d BLOCKS IF (b<=d) READ t(b<=d,j<=k) IF (d<b) READ t(d<b,j<=k) READ v(l<c,i<d) ! Always holes < particles IF (spin/spatial sym block of t is non-zero) THEN IF (spin/spatial sym block of v is non-zero) THEN MAKE x(l,b,c,i,j<=k) BLOCK BY DGEMM IF (b<=c and i<=j) ACCUM x(l,b<=c,i<=j<=k) IF (b<=c and j<=i,i<=k) ACCUM -x(l,b<=c,j<=i<=k) IF (b<=c and k<=i) ACCUM x(l,b<=c,k<=i<=j) IF (c<=b and i<=j) ACCUM -x(l,c<=b,i<=j<=k) IF (c<=b and j<=i,i<=k) ACCUM x(l,c<=b,j<=i<=k) IF (c<=b and k<=i) ACCUM -x(l,c<=b,k<=i<=j) END IF ! Note that b=c, i=j block is accumulated END IF ! multiple times END LOOP END LOOPEND LOOP

19

Extensible many-electron theory module in NWChem

Extensible many-electron theory module in NWChem

• “Extensible” because a new many-electron method can be added relatively easily by TCE

• Very general tensor storage interface (needs only size & offsets of one-dimensional compressed tensor arrays; intermediate arrays’ offsets are also computed in run-time by programs generated by TCE )

• Compatible one- and two-electron integral transformation codes and offset generators

20

OptimizationsOptimizations

• Spin, spatial, permutation symmetries• Dynamic tiling (orbital ranges are “tiled” (or

blocked) into smaller section so that the peak local memory usage does not exceed the user-specified limit)

• Dynamic load balancing parallelism (each tile-level tensor contraction is carried out in one processor with virtually no communication)

• Disk I/O is based on Shared File Library of ParSoft, which allows one-sided (independent) read/write without Global Array cache

• Local sorting of array elements (so that the composite summation indices become contiguous in memory) followed by local DGEMM (with absolutely no communication in this critical step)

21

Previous & new algorithmsPrevious & new algorithms

DRA DRADRADRADRA

GA

MAGA to MA sort (communications!)

Collective I/O (synchronization!) & GA cache

SF SFSFSFSF

MAMA to MA sort (no communications!)

One-sided I/O (no synchronization!)

MA

22

Methods availableMethods available

• Various spin-unrestricted coupled-cluster methods• LCCD, CCD, LCCSD, CCSD, CCSDT• More to follow (higher CC, CI, MBPT, EOM-CC,

etc.)

• Input syntax• Uses NWDFT module for the ground statedft

xc Hfexch 1.0end

tceccsdthresh 1e-6maxiter 100end

task tce energy

23

A sample output (water CCSD/sto-3g)

A sample output (water CCSD/sto-3g)

NWChem General Electron-Correlation Theory Module ------------------------------------------------- Programs generated by a Tensor Contraction Engine

General Information ------------------- Wavefunction type : Restricted No. of electrons : 10 Alpha electrons : 5 Beta electrons : 5

/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\

Correlation Information ----------------------- Calculation type : Coupled-cluster singles & doubles (CCSD) Max iterations : 100 Residual threshold : 0.10E-09

Memory Information ------------------ Available GA+MA space size is 26213624 doubles

Maximum block size 50 doubles

24

A sample output (continued)A sample output (continued) Suggested orbital blocking

Block Spin Irrep Size Offset ----------------------------------------- 1 alpha a 5 doubles 0 2 beta a 5 doubles 5 3 alpha a 2 doubles 10 4 beta a 2 doubles 12

/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\

2-e file size = 5443 2-e file name = ./temp.v2 Cpu time / sec 0.0

/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\

t2 file size = 300 t2 file name = ./temp.t2 Cpu time / sec 0.0

MBPT(2) correlation energy = -0.035867246917899 hartree MBPT(2) total energy = -74.998530309066552 hartree Cpu time / sec 0.0

25

A sample output (continued)A sample output (continued) ------------------------------------------------------- Iter Residuum Correlation Cpu/Sec ------------------------------------------------------- 1 0.089123237955088 -0.035867246917899 0.1 2 0.031759620132034 -0.045406888265697 0.1 3 0.012682891602275 -0.048387005902666 0.1 4 0.005383277884425 -0.049437059764660 0.1 5 0.002395445228466 -0.049839118488995 0.1 6 0.001110827268269 -0.050002172402908 0.1

/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\

26 0.000000002031284 -0.050127328255753 0.1 27 0.000000001066715 -0.050127328323605 0.0 28 0.000000000560286 -0.050127328359134 0.1 29 0.000000000294338 -0.050127328377747 0.1 30 0.000000000154649 -0.050127328387501 0.1 31 0.000000000081266 -0.050127328392616 0.1 ------------------------------------------------------- CC iteration converged CCSD correlation energy = -0.050127328392616 hartree CCSD total energy = -75.012790390541269 hartree

Task times cpu: 2.0s wall: 2.4s

26

PerformancePerformance

• Titan spin-adapted parallel CCSD code • H2O CCSD/cc-pVTZ

Energy = – 0.2850225 hartree1 node sym=off 16.8 secs/iter1 node sym=on 16.6 secs/iter2 nodes sym=off 8.2 secs/iter2 nodes sym=on 8.3 secs/iter

• Present spin-unrestricted parallel CCSD code• H2O CCSD/cc-pVTZ

Energy = – 0.2850225 hartree1 node sym=off 49.1 secs/iter1 node sym=on 14.5 secs/iter2 nodes sym=off 25.2 secs/iter2 nodes sym=on 7.5 secs/iter

Spin-unrestricted code has to deal with 3 times as many t-amplitudes as does spin-adapted code, so theoretically spin-adapted code should be 3 times as fast as spin-unrestricted code

27

Future plansFuture plans

• CCSDTQ, CI, MBPT, EOM-CC implementation• What is the appropriate tensor formulation for MBPT? (are the

MBPT denominators tensors?) See Head-Gordon et al.• “Persistent intermediates” (or the so-called similarity

transformed Hamiltonian matrix elements) in EOM-CC

• CC(2)PT(2) implementation• Post-CCSD(T) O(n7) method that includes perturbative

quadruples

• Further optimization (loop fusion, more aggressive factorization, space-time tradeoffs, etc.) by computer scientist colleagues

• Modular extensibility of operator contraction engine• Active spaces (multi-reference methods)• Orbital rotations (atomic-orbital-based or local correlation

methods)