K. Roche Future Technologies Group Computer Science and Mathematics Division Oak Ridge National Laboratory THE UNEDF COLLABORATION –COMPUTER SCIENCE RESEARCH

K. RocheFuture Technologies Group

Computer Science and Mathematics DivisionOak Ridge National Laboratory

THE UNEDF COLLABORATION

–COMPUTER SCIENCE RESEARCH AND SOFTWARE DEVELOPMENT

1. Superfluid Local Density Approximation Imaginary Time DFT Solver –Magierski, BulgacTime Dependent DFT –Bulgac, Yu

2. Coupled Cluster Tensor Based Products -Dean

Time-Dependent Superfluid Local Density Approximation (TD-SLDA)Time-Dependent Superfluid Local Density Approximation (TD-SLDA)generalization of Kohn-Sham LDA to superfluid systemsgeneralization of Kohn-Sham LDA to superfluid systems

3( ) ( , , ) ( , , )exp( )Q d rdtQ r t r t i t

Features of this extension to DFT:All nuclei (odd, even, spherical, deformed) Any quantum numbers of QRPA modesFully selfconsistent, no self-consistent symmetries imposed

3 4 5

3

, 30...100, 10 ...10

number of ( , , ) ( )x t x t

n x t

N N N N

r t O N N

Higgs mode of the pairing field in a homogeneous unitary Fermi gas - Bulgac, Yoon

Observables and spectrum

•Space-time lattice use of FFTW for spatial derivative •No matrix operations

• C version and F90 versions exist and are numerically consistent• Energy and particle number are conserved in time

• Compiled and run on various platforms• Intel Xeon dual processor, quad-core (2X)• AMD Opteron single processor, dual-core (quad-core forthecoming)

• e.g. 40^3 on jaguarcnl on 3200 nodes at 58s / time step• (1600 pes) 750e12 INS , 46e12 FLOP / time step

• Scaling behavior• max lattice size ~ 66^3 on jaguar ~ ( 31,328 PEs , 2GB RAM / PE , 250TF )

• (per time step) STORAGE ~ O(2^40 BYTE) , COMPUTE ~ O(2^40 FLOP)• (per time step) Petascale problem at Nx = Ny = Nz > 100

• strong, fix lattice dimensions and add MPI processes thus reducing the number of qp wavefunctions per MPI process

• weak, increased problem complexity and hardware allocation, fixed walltime• Metric ~ cost($,time) / lattice site / wavefunction / timestep / pe

• not ideal –memory bound

PEsjaguarcnl

576 1152 1728 2304

<NWF / PE>

48 24 16 12

[s]/ts 56.2 28.8 19.3 14.92

(e.g. Nx = Ny = Nz = 30 )

State HO SO

N=22

Spline no

SO

Spline

SO

Wav no SO

1.e-4

Wav SO

1.e-2

3D-lattice

no SO

3D-lattice

SO

1, ½+ -22.23984 -22.2401 -22.2400 -22.24011 -22.240 -22.2402 -22.2402

2, ½- -22.23949 -22.2400 -22.2399 -22.23998 -22.239 -22.2401 -22.2401

3, ½+ -9.43509 -9.22050 -9.4365 -9.217991 -9.420 -9.22062 -9.43674

4, 3/2+ -9.42925 -9.21260 -9.4319 -9.21215 -9.416 -9.21271 -9.43214

5, 3/2- -9.42911 -9.21129 -9.4310 -9.21176 -9.415 -9.21271 -9.43092

6, ½- -9.42490 -9.21116 -9.4278 -9.21168 -9.409 -9.21140 -9.42799

7, ½+ -8.77589 -9.20595 -8.7782 -9.21077 -8.809 -9.21140 -8.77839

8, ½- -8.77013 -9.20585 -8.7737 -9.20989 -8.806 -9.20606 -8.77394

9, ½+ -1.71593 -1.72467 -1.7239 -1.72499 -1.725 -1.72840 -1.72843

10, ½- -1.51149 -1.52621 -1.5251 -1.52707 -1.527 -1.52276 -1.52279

•Goal is to compute ground state energies for all nuclei –a basis for time dependent code

•Lattice code w/ plane wave basis and extensive use of FFT•Current version is not explicitly parallel •General loss of orthogonality –must diagonalize / re-orthogonalize

• .75 walltime ~ evolution ; .25 walltime ~ diagonalization

•Parallel discretization (will be) same as for time dependent slda lattice code

Lattice DFT solver (imaginary time evolution) -SLDALattice DFT solver (imaginary time evolution) -SLDA

Computed Energies: Harmonic Oscillator, Spline, Multiwavelets, and 3D-lattice

Coupled Cluster

•ab initio technique

•calculates ground-state properties of closed-shell (or sub-shell) nuclei

•solves coupled nonlinear sets of equations (largest ~10M unknowns)

•CCSD ~ O(nu^2 * No^4) (nu:=unoccupied orbitals; No:=occupied)

•CCSDT ~ O(nu^3 * No^5)

WHERE i,j,k,l = n := occupied AND a,b,c,d = N – n := unoccupied

~

1 1 1 1

[ , 1, ][ , 1, ]n n N N

ab ac dbij ik lj

k l c n d n

i j n a b n N T kl cd T T

GOAL n~100 , N~1000 O( 2^40 BYTES ) –not memory bound O( 1E13 FLOP ) -CCSD O( 1E18 FLOP ) -CCSDT

sum of wall time: 449.967129 sum of cpu time: 561.210000

•Irregular data access patterns for copying / computation

•Forced network / disk usage bc of non-local terms

•Products of rectangular arrays don’t perform like square ones

•Performance is Mixtureof BLAS1 , BLAS2 , BLAS3

•Large variance per term

HARD PROBLEM

•Tensor Contraction Engine (TCE)

•Rely on combination of local disk and communication network

•Cannon’s algorithm, replicated data blocks , accumulation via all-all reduction over partial sums

•Portable Remote Memory Copy Library for Distributed Array Libraries and Compiler Run-time Systems (Aggregate Remote Memory Copy Interface, Global Arrays)

•Shared and remote-memory based universal matrix multiplication algorithm

•Co-array Fortran

Related (current and past) Computer Science Efforts

(Coupled Cluster cont.)

Software Dependencies

•OPERATING SYSTEMS•UNIX (Compute Node Linux, Cray)

•LANGUAGES / COMPILERS•C, Fortran90

•GCC(3.*,4.*)•Portland Group(7.*)•Intel(9.*,10*)

•LIBRARIES•MPI1,2 (MPICH, Cray, Intel, OpenMPI)•POSIX Threads•BLAS(ATLAS,GOTO,LIBSCI,ACML,MKL)•LAPACK(ATLAS,LIBSCI,ACML,MKL)

•ScaLAPACK(LIBSCI,CMKL)•FFTW(LIBSCI,ACML,MKL)•LUSTRE

•ISO/IEC JTC1/SC22 - Programming languages and operating systems

•C99 + TC1 + TC2, WG14 N1124, dated 2005-05-06http://www.open-std.org/JTC1/SC22/WG14/www/standards.html

•F03 (08): P1(base) + P2(variable length strings) + P3(conditional compilation)http://www.nag.co.uk/sc22wg5http://www.co-array.org/caf_intro.htm (co-arrays -remote memory ops)

•POSIX Threads: IEEE Std 1003.1, 2004 Edition;ISO/IEC standard 9945-1:1996http://www.unix.org/single_unix_specification

http://www.open-std.org/JTC1/SC22/WG14/www/standards.html

http://www.nag.co.uk/sc22wg5

http://www.co-array.org/caf_intro.htm

http://www.unix.org/single_unix_specification

http://www.unix.org/single_unix_specification

BONUS SLIDES

How to Speedup Relevant SLDA Computations Exploiting Lightweight Processes (PTHREADS) on Multicore Nodes

•Partial densities –e.g. local summations

•Strong scaling of computations to number of processor cores

•Discrete Fourier Transforms

•Multithreaded versions for transpose and multiplication

•Communication and I/O (MPI_THREAD_Init())

•swap quasi-particle wavefunctions to / fromnetwork disk / memory

•overlap of reduction and computation

•Construct macro indices I = {k,c} , J = {d,l} , K = {b,j} , L= {a,i}

•Expensive copy

•Q(J,I) , P(K,J) , S(I,L) are the results

•Make operation look like sequence of matrix multiplications

•Compute P(K,J) * Q(J,I) = R(K,I)

•Compute R(K,I) * S(I,L) = W(K,L)

•Store W(K,L) as

I=0Do a=1,N Do b=1,N I = I +1 J = 0 Do i=1,n Do j=1,n J = j + n*(i-1) J = J + 1 A(I,J) = t(a,b,i,j)~

abijT

copy

Coupled Cluster: Macro-index construction from rank four object

[rochekj@athena0 ~]$ cat /share/data1/rochekj/2111149.dat8 32768 32768/share/data1/rochekj/2111149.dat.0 0 1073741823/share/data1/rochekj/2111149.dat.1 1073741824 2147483647/share/data1/rochekj/2111149.dat.2 2147483648 3221225471/share/data1/rochekj/2111149.dat.3 3221225472 4294967295/share/data1/rochekj/2111149.dat.4 4294967296 5368709119/share/data1/rochekj/2111149.dat.5 5368709120 6442450943/share/data1/rochekj/2111149.dat.6 6442450944 7516192767/share/data1/rochekj/2111149.dat.7 7516192768 8589934591

Matrices

0 2

34

1

5

MPI_COMM_WORLD

0 1 2

3 4 5

8 100 100 400 400./31243.dat.0 0 1073741823./31243.dat.1 1073741824 2147483647./31243.dat.2 2147483648 3221225471./31243.dat.3 3221225472 4294967295./31243.dat.4 4294967296 5368709119./31243.dat.5 5368709120 6442450943./31243.dat.6 6442450944 7516192767./31243.dat.7 7516192768 8589934591./31243.dat.8 8589934592 9663676415./31243.dat.9 9663676416 10737418239./31243.dat.10 10737418240 11811160063./31243.dat.11 11811160064 12799999999

Higher Rank Objects

Virtual rectangularProcess grid

0

3

1

4

2

5

Subgroup formation

0 1 2

Multithreaded range retrieval

0

3

1

4

2

5

Subgroup Bcast , parse

*replace PATH w/ struct holding MPI id and ptr

Disk and network memory (MPI) based arbitrary rank object storage and manipulation: data structure explorations

[rochekj@athena0 kr-pt]$ ./xkfil_rd 4 /share/data1/rochekj/2111149.dat 0/share/data1/rochekj/2111149.dat: wall_t: 0.009792 cpu_t: 0.000000dim[0] = 32768dim[1] = 32768[0] myoff=0 sb=0 eb=2147483647 tb=2147483648[1] myoff=268435456 sb=2147483648 eb=4294967295 tb=2147483648[2] myoff=536870912 sb=4294967296 eb=6442450943 tb=2147483648[3] myoff=805306368 sb=6442450944 eb=8589934591 tb=2147483648[0 / 4]: read BYTES := 2147483648[1 / 4]: read BYTES := 2147483648[2 / 4]: read BYTES := 2147483648[3 / 4]: read BYTES := 2147483648BYTES(kfil_rd_r) : 8589934592 t[s]: 33.940000241.4MBps[rochekj@athena0 kr-pt]$

[rochekj@athena0 kr-pt]$ ./xkfil_rd 1 /share/data1/rochekj/2111149.dat 0/share/data1/rochekj/2111149.dat: wall_t: 0.009448 cpu_t: 0.000000dim[0] = 32768dim[1] = 32768[0 / 1]: read BYTES := 8589934592BYTES(kfil_rd_r) : 8589934592 t[s]: 45.968010178.2MBps[rochekj@athena0 kr-pt]$

Examples: kfil_rd_r() , kfil_rd_panel_r() –as needed for PBLAS, ScaLAPACK based operations

1 process

4 pthread processesfrom within process

0

Multithreaded range retrieval

Multi-threaded Asynchronous List Construction and Control in MPI Space

•Useful to both slda and cc software efforts

•Based on producer driven bounded buffer problem •client / server based producer consumer model

•MPI_THREAD_Init() •MPI_THREAD_{SINGLE,FUNNELED,SERIALIZED,MULTIPLE}

•Discuss benefits here …

Memory:

-Theoretical peak: (bus width)*(bus speed) 6GBytes/second

Computation:

-Theoretical peak:(# cores)*(flops/cycle/core)*(cycles/second)

BLAS remain one tool of choice but Moore’s Law still lingers …

y = x + y : 3 operands (24 Bytes) needed for 2 flops(3operands/2flop)*(8bytes/operand)*2core*2(flop/cyc/core)*2.6e9(cyc/sec)requires ~125 GBytes per second; have 6GBps how to get sustainability??

BLAS 1: O(n) operations on O(n) operandsBLAS 2: O(n**2) operations on O(n**2) operandsBLAS 3: O(n**3) operations on O(n**2) operands

BE AWARE AND LEARN FROM PAST EXPERIENCES!

1. Form group G1 from MPI_COMM_WORLD2. Form group G2 := outliers (feature extraction)3. Form group G3 = G1 \ G2 and COMM3 (work group and communicator)

'

/2 /2 2 2 /2 /2 3

' '

( ) , 1,...,

/

( ) ( ), ( ) ( )

( ) ( ), 0

1~ 1 ( ) ( ) ( )

2

~ ,

:

[ 1,

nn

n

Hn n

H T V V T

n m

x n NWF

dx Lx Nx

H T V x V SO V SO V i p

e x x

e e e V SO V SO e e O

NWF NWF diagonalize

convergence

n

1 1

]

log Niter Niter Niter Nitern n n n

NWF

H H accuracy

Imaginary Time DFT Solver

( ) ( ) ( )effx g x x

The renormalized energy density functional

-

~ [ ( ), ( )]k k ku x v x Bogoliubov quasi-particle wavefunctionwith index label, k

•Computation is parallelized over index label k defined ona periodic cubic lattice of dimension Nx**3 in a plane wave basis

•Each process has a copy of the lattices (x,k)

•Partial densities are computed locally, reduced and broadcast

•Computation of gradients is required for this step

•Fixing the number density of the homogeneous gas assigns energyper particle in k-space, makes the chemical potential and the pairing gap computable, and leads to the parameters in the energy functional

•We can now evaluate the energy functional and begin time evolution

Adams-Bashforth (Moulton): predictor, corrector, modifier

•After computation of p(n+1), and m(n+1), the kinetic energy densitymust be updatedwe must compute time derivatives and gradients of the qpwf and subsequently the required densities

•Next, c(n+1) and phi(n+1) can be computed and used to update the densitiesAgain (as in previous step)

•The energy functional and total particle number are evaluated,time is bumped, the values of the quasi-particle wavefunctions at differenttime indices are shifted (copied in memory)

‘

Documents

K. Roche Future Technologies Group Computer Science and Mathematics Division Oak Ridge National Laboratory THE UNEDF COLLABORATION –COMPUTER SCIENCE RESEARCH