Large-Scale GW Calculations on Pre-Exascale HPC Systems Archive/tech_poster/poster_files... ·...

Large-Scale GW Calculations on Pre-Exascale HPC SystemsBerkeleyGW: Method Developments and Code Optimization

M. Del Ben1, F.H. da Jornada1,2, A. Canning1, N. Wichmann3, K. Raman4, R. Sasanka4, C. Yang1, S.G. Louie1,2 and J. Deslippe1

1Lawrence Berkeley National Laboratory, 2University of California at Berkeley, 3CRAY, 4Intel Corporation

The GW Method

Solve Dyson’s equation: [−1

2∇2 + Vloc + Σ(En)

]φn = Enφn, (1)

Σ(En) → self-energy (non-Hermitian, non-local, energy-dependent operator)

I Perturbative expansion on the screened Coulomb interaction W

I First approximation Σ = iGW

I W obtained from Inverse Dielectric Matrix ε−1 of the system

In BerkeleyGW [1]:

I Epsilon code → Compute ε−1

I Sigma code → Compute W from ε−1 and solve eq.1

Epsilon code: Inverse Dielectric Matrix ε−1

Input: ψmk, εmk, {q-points}, {ωi}1. Calculate plane-waves matrix elements (FFT’s): O(NvNcNG logNG)

MGjak(q) = 〈ψjk+q| e i(G+q)·r |ψak〉

2. Calculate RPA polarizability (Matrix-Multiplication): O(NωNvNcN2G)

χ(q, ωi) = M(q)†∆jak(εjk, εak,q, ωi)M(q)

∆ diagonal matrix containing the frequency dependence

3. Dielectric matrix ε and inversion: O(NωN3G)

εGG ′(q, ωi) = δGG ′ − v(q + G)χGG ′(q, ωi)

ε−1(q, ωi) = (I − vχ(q, ωi))−1

Parameters: Nv number of valence bands, Nc number of conduction (empty) bands, NG PW basis set size, Nω number of frequencies.

Static Subspace Approximation [2]: Speed-Up the Calculation of ε−1(ωi) for ωi 6= 0

I For ωi = 0: Standard calculation of χ̄(0) = v12χ(0)v

χ(0) = M†∆jak(0)M

Eigendecomposition χ̄(0) = C0†xC0, define C0s according to threshold teigen

I Projection of M into the subspace spanned by C0s

M0s = Mv

I For ωi 6= 0: direct computation of χ̄s(ωi)

χ̄s(ωi) = M0s

†∆jak(ωi)M

I Final evaluation of ε−1(ωi) from χ̄s(ωi)

Execution Memory

Matrix Element O(NvNcNG logNG) O(NvNcNG)

Polarizability ω = 0 O(NvNcN2G) O(N2

Eigendecomposition: C0s O(N3

G) O(N2G)

M0s O(NbNvNcNG) O(NvNcNb)

Polarizability ω 6= 0 O(NωNvNcN2b) O(NωN

Inversion O(NωN3b) O(NωN

I/O O(NGNb + NωN2b) O(NGNb + NωN

Evaluation of ε−1(ωi) O(NωNbN2G) O(NωN

References

1. J. Deslippe, G. Samsonidze, D. A. Strubbe, M. Jain, M. L. Cohen, and S. G. Louie, Comput. Phys. Commun. 183, 1269 (2012).

2. M. Govoni and G. Galli, J. Chem. Theory Comput. 11, 2680 (2015) ; T.A. Pham, H.V. Nguyen, D. Rocca, and G. Galli, Phys. Rev. B 87,

155148 (2013) ; H.-V. Nguyen, T. A. Pham, D. Rocca, and G. Galli, Phys. Rev. B 85, 081101 (2012) ; H. F. Wilson, D. Lu, F. Gygi, and

G. Galli, Phys. Rev. B 79, 245106 (2009) ; H. F. Wilson, F. Gygi, and G. Galli, Phys. Rev. B 78, 113303 (2008).

3. P. Umari, Geoffrey Stenuit, and Stefano Baroni, Phys. Rev. B 79, 201104-R (2009)

4. M. Del Ben, F.H. da Jornada, A. Canning, N. Wichmann, K. Raman, R. Sasanka, C. Yang, S.G. Louie and J. Deslippe, In Preparation

Benchmark Calculations

Silicon Carbide (β−SiC):

(a) (b) (c)

Figure: (a) band structure as obtained without (solid blue) and with the static subspace approximation (red dottedteig = 0.01); (b) mean absolute error between the reference and approximate results (196 EQP calculated); (c)percentage reduction in time to solution (red) and number of eigenvectors (blue) for the evaluation of ε−1.

Single Vacancy V 01 (1000-Si Atoms Benchmark)

∆E v ∆E g ∆E c ∆E Si #Nodes Time (min)

LDA 0.29 0.07 0.23 0.60 - -

G0W0 (6Ry) 0.41 0.27 0.46 1.15 480 73

G0W0 (12Ry) 0.42 0.26 0.49 1.18 2048 157

Energies in eV

G0W0 = Full-Frequency Contour Deformation

Calculations Performed on Edison@NERSC (CRAY-XC30)

I Approximation tested for Insulators, Semiconductors, Metals, Clusters, Slabs

I Direct correlation between accuracy and teigen (5-25% of the eigenstates →∼ 1 meV accuracy)

Sigma code: Calculate Σlm(ω) and Solve Dyson’s Equation

1. Calculate plane-waves matrix elements (FFT’s): O(NΣNnNG logNG)

M−Gnm = 〈φn| e−iG·r |φm〉2. For any given pair of orbital functions {φl , φm} calculate:

Σlm(E ) =i

∫ ∞0

dω∑n

∑GG ′

M−Gnl

ε−1GG ′(ω) · v(G ′)

E − En − ωM−G

Matrix-Multiplication + 2D Dot-Product O(NΣNωNnN2G)

Parallelization Strategy

Σlm(ω) matrix elements distributed over Pools of processes. For each Pool:

Data distribution layout

NG ' 105 ; Nn ' 104 ; NE ' 102

I Left Matrix: ε−1(ω) distributed over rows (NG ×Nω combined index)

I Right Matrix: MGnl distributed over columns (FFT performed locally)

I At each cycle the matrix contraction is performed locally (ZGEMM)

I Ring communication restricted to contiguous MPI tasks

Non-blocking cyclic communication layout

I Communication can be performed over blocks of processes:I Nn′′ size roughly constant independent on the ration OMP×MPII Good ZGEMM performance independent on the number of MPI

tasks employed

Same algorithm is used in the low-rank approximation case: matrix size NG → Nb

I Speed-up proportional to (NG/Nb)2

I Good performance achieved by using smaller pool sizes

Performance Measurement on Cori-KNL@NERSC

Systems: Divacancy states in Silicon supercells containing 998 and 1726 atoms.

I FLOPs per node

I Best/worst performing node

I Strong Scaling

I Time to solution

I Comparison to peak performance

Single Pool Performance: 200 KNL Nodes

(a) (b) (c)

(a) Poor performance due to small Comput./Commun. ratio (different OMPxMPI ratio, nomessage blocking) (b) Improved performance for 64-OMPxMPI by using different processes blocksize (c) Different OMPxMPI ratio and process block size adjusted to give roughly constant n′′.

Strong Scaling: Individual Pool and Full Sigma

(d) (e)

(a) Individual Pool scaling, total execution (red squares) and computationally intense part (bluecircles), (b) Full Sigma, 200-NKL nodes per Pool.

Full Sigma: Best Performance

998 Si 1726 Si

Number KNL Nodes 9600 9500

Number of Cores 633,600 627,000

Number Eqp Evaluated 48 38

Time to solution (s) 160 201

PetaFLOP/s 11.8 11.3

% Peak Performance 47 46

I Sigma kernel capable to scale to full-Cori

I Sigma achieves high fraction of peak performance

I Excellent time to solution (∼100 seconds) for systems made of thousands of atoms

Acknowledgments

This work was supported by the Center for Computational Study of Excited-State Phenomena inEnergy Materials at the Lawrence Berkeley National Laboratory, which is funded by the U.S.Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences andEngineering Division under Contract No. DE-AC02-05CH11231, as part of the ComputationalMaterials Sciences Program.

mdelben@lbl.gov http://www.berkeleygw.org/

Large-Scale GW Calculations on Pre-Exascale HPC Systems Archive/tech_poster/poster_files... ·...

Documents

FUJITSU HPC: The Next Step Toward Exascale

Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

Programming Models for Exascale Systemsnowlab.cse.ohio-state.edu/.../slide/hpcac_lugano14_pmodels.pdf · HPC Advisory Council Switzerland Conference, Apr '14 • High Performance

HPC Storage - DDN.com · PDF file• HPC Storage • September 2015 Lustre Towards ExaScaler. DDN/Intel Webinar. Content • Exascale Storage Challenges (Intel) ... – DDN Automated

OpenCL – OpenACC & Exascale - CEA - HPC · • Exascale architectures may be o Massively parallel o Heterogeneous compute units o Hierarchical memory systems o Unreliable o Asynchronous

Paving The Road to Exascale Computing€特別講演】Todd Wilde.pdfPaving The Road to Exascale Computing HPC Technology Update Todd Wilde, Director of Technical Computing and HPC

Nu exas EXACT - Exascale Projectsexascale-projects.eu/EuroExaFinalBrochure_v1.0.pdf · 3 European Exascale Projects – A community is born High Performance Computing (HPC) is at

The Race Towards Exascale - HPC Advisory Council · Exascale will be Enabled via Co-Design Architecture Software –Hardware Hardware –Hardware (e.g. GPU-Direct) Software –Software

Exascale: Power, Cooling, Reliability, and Future Arithmetic John Gustafson HPC User Forum Seattle, September 2010

Parallel Software usage on UK National HPC Facilities 2009 ...archer.ac.uk/documentation/white-papers/app-usage/UKParallel... · exascale HPC facilities is the ability of parallel

Seagate Exascale HPC Storage

R&D work on pre exascale HPC systems

Designing Efficient HPC, Big Data, Deep Learning, and for ... · 11/2/2017 · Designing Efficient HPC, Big Data, Deep Learning, and Cloud Computing Middleware for Exascale Systems

HPC Strategy AngV2 - Sandia National Laboratories · 2014. 7. 7. · Sandiaʼs HPC Hardware Architecture Strategy Development of our foundation for Exascale August 30, 2011 James

Contain This, Unleashing Docker for HPC · Contain This, Unleashing Docker for HPC 1 ... the"power"of"Docker" ScienceatScale%% % Petascale(to(Exascale(Simula0ons(SciencethroughVolume%%

Soft Errors, Silent Data Corruption, and Exascale Computing · 2018-02-27 · Soft Errors, Silent Data Corruption, and Exascale Computing Sarah Michalak ... Some HPC platforms are

The Challenges of Exascale · 6 HPC in 2018-2020 • Exascale (1018) systems arrive ♦ Issues include power, concurrency, fault resilience, memory capacity • Likely features ♦

Pathways to Convergence - European eXtreme Data and ... · BDEC = Big Data & Exascale Computing. European HPC Technology Projects 3 • Successor to the IESP (International Exascale

Https://portal.futuregrid.org Scientific Computing Supported by Clouds, Grids and HPC(Exascale) Systems June 24 2012 HPC 2012 Cetraro, Italy Geoffrey Fox

H2020-FETHPC-3-2017 - Exascale HPC ecosystem development ... · ETP4HPC. The project promotes global community structuring and synchronization in domains such as HPC, Big Data, cloud