22
Frank Giraldo Department of Applied Mathematics Naval Postgraduate School Monterey CA 93943 [email protected] http://frankgiraldo.wix.com/mysite Lessons Learned on the Development of a Flexible Software Infrastructure for ESMs for Modern Computing Architectures 1

Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

Frank Giraldo

Department of Applied Mathematics

Naval Postgraduate School

Monterey CA 93943

[email protected]

http://frankgiraldo.wix.com/mysite

Lessons Learned on the Development of a Flexible Software Infrastructure for ESMs for Modern Computing Architectures

1

Page 2: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

AcknowledgementsFunded by ONR, AFOSR, NSF, DOE

Parts of this talk include discussions with:

Jeremy Kozdon and Lucas Wilcox (NPS)

Michal Kopera (Boise State)

Simone Marras (NJIT)

Daniel Abdi (TempoQuest)

Emil Constantinescu (Argonne National Lab)

Andreas Mueller (ECMWF)

2

Page 3: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

Talk Summary

Caveats

Lesson Learned via Development of GNuME

Summary of Lessons Learned

Where to next

3

Page 4: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

CaveatsThis talk represents opinions based on my own experience

building unstructured/adaptive fluids dynamics solvers (3

big codes so far).

I am an applied mathematician and not a software

engineer - although have learned some best practices.

I am programming language agnostic - use what I need

although have mainly used Fortran.

Lessons learned will be applied to a new collaboration with

Caltech, MIT, and JPL on Data-driven ESMs.

4

Page 5: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

Talk Summary

Caveats

Lessons Learned via Development of GNuME

Summary of Lessons Learned

Where to next

5

Page 6: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

GNuME (Galerkin Numerical Modeling Environment) is a framework for developing flow solvers. GNuME contains a suite of high-order spatial discretization methods (CG/DG), time-integrators, adaptive mesh refinement, with an emphasis on targeting current and future HPC architectures. GNuME currently houses three components:

1. NUMA1 - nonhydrostatic (deep-planet) atmospheric model (global/limited-area)

2. NUMO - nonhydrostatic ocean model (limited-area)

3. Shallow water - coastal and global with wetting and drying

GNuME is written in modern Fortran and C.

Fortran used primarily.

C used for interfacing with C-libraries and for many-core compute kernels.

GNuME emerged from the desire to unify all the codes in my group under one umbrella.

What is GNuME

6[1] NUMA used in U.S. Navy’s NEPTUNE NWP system

Page 7: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

GNuME Components

SWE shallow water

equation solver

NUMO incompressible Navier-Stokes

equation solver

NUMA global or local compressible Navier-Stokes

equation solver

[1] Burstedde et al, SIAM J. Sci. Comput. 5 (2015)

OCCA2

[2] Medina et al, arXiv:1403.0968} (2014)

MPI

p4est1 adaptive mesh

manager

CG/DG Element-based

Galerkin Methods

Time-Integrators

Elliptic Solvers/

Preconditioners

Page 8: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

Separating the core (red) components allowed us to unify all 10 codes for maintainability (e.g., various PDEs, 2d, 3d, CG, DG, etc.)

The communicators (brown) are radically different. MPI is quite general (array-based). Many-core is quite specific (e.g., Structure of Arrays -> Array of Structures).

The solvers (blue) are independent and users can add new components safely. Ideally, they should have their own subdirectories and distinct builds.

GNuME Components: Some Lessons

8

Page 9: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

GNuME Components: CG/DG Numerics

⌦e

�e<latexit sha1_base64="f1z7iLKNKy5Oy9yDCQvrnRp/hbg=">AAAB7nicbVBNS8NAEN34WetX1aOXYBE8lUQEFTwUPOixgrGFNpTJdtIu3d3E3Y1QQv+EFw8qXv093vw3btsctPXBwOO9GWbmRSln2njet7O0vLK6tl7aKG9ube/sVvb2H3SSKYoBTXiiWhFo5ExiYJjh2EoVgog4NqPh9cRvPqHSLJH3ZpRiKKAvWcwoGCu1OjcgBHSxW6l6NW8Kd5H4BamSAo1u5avTS2gmUBrKQeu276UmzEEZRjmOy51MYwp0CH1sWypBoA7z6b1j99gqPTdOlC1p3Kn6eyIHofVIRLZTgBnoeW8i/ue1MxNfhDmTaWZQ0tmiOOOuSdzJ826PKaSGjywBqpi91aUDUECNjahsQ/DnX14kwWntsubdnVXrV0UaJXJIjsgJ8ck5qZNb0iABoYSTZ/JK3pxH58V5dz5mrUtOMXNA/sD5/AE3eY+a</latexit><latexit sha1_base64="f1z7iLKNKy5Oy9yDCQvrnRp/hbg=">AAAB7nicbVBNS8NAEN34WetX1aOXYBE8lUQEFTwUPOixgrGFNpTJdtIu3d3E3Y1QQv+EFw8qXv093vw3btsctPXBwOO9GWbmRSln2njet7O0vLK6tl7aKG9ube/sVvb2H3SSKYoBTXiiWhFo5ExiYJjh2EoVgog4NqPh9cRvPqHSLJH3ZpRiKKAvWcwoGCu1OjcgBHSxW6l6NW8Kd5H4BamSAo1u5avTS2gmUBrKQeu276UmzEEZRjmOy51MYwp0CH1sWypBoA7z6b1j99gqPTdOlC1p3Kn6eyIHofVIRLZTgBnoeW8i/ue1MxNfhDmTaWZQ0tmiOOOuSdzJ826PKaSGjywBqpi91aUDUECNjahsQ/DnX14kwWntsubdnVXrV0UaJXJIjsgJ8ck5qZNb0iABoYSTZ/JK3pxH58V5dz5mrUtOMXNA/sD5/AE3eY+a</latexit><latexit sha1_base64="f1z7iLKNKy5Oy9yDCQvrnRp/hbg=">AAAB7nicbVBNS8NAEN34WetX1aOXYBE8lUQEFTwUPOixgrGFNpTJdtIu3d3E3Y1QQv+EFw8qXv093vw3btsctPXBwOO9GWbmRSln2njet7O0vLK6tl7aKG9ube/sVvb2H3SSKYoBTXiiWhFo5ExiYJjh2EoVgog4NqPh9cRvPqHSLJH3ZpRiKKAvWcwoGCu1OjcgBHSxW6l6NW8Kd5H4BamSAo1u5avTS2gmUBrKQeu276UmzEEZRjmOy51MYwp0CH1sWypBoA7z6b1j99gqPTdOlC1p3Kn6eyIHofVIRLZTgBnoeW8i/ue1MxNfhDmTaWZQ0tmiOOOuSdzJ826PKaSGjywBqpi91aUDUECNjahsQ/DnX14kwWntsubdnVXrV0UaJXJIjsgJ8ck5qZNb0iABoYSTZ/JK3pxH58V5dz5mrUtOMXNA/sD5/AE3eY+a</latexit>

n<latexit sha1_base64="KocwDc1L8B7UxHM0TciBMLBaFZQ=">AAAB8HicbVBNS8NAFHypX7V+VT16WSyCp5KKoIKHghePFYwttqFsti/t0s0m7G6EEvovvHhQ8erP8ea/cdPmoK0DC8PMe+y8CRLBtXHdb6e0srq2vlHerGxt7+zuVfcPHnScKoYei0WsOgHVKLhEz3AjsJMopFEgsB2Mb3K//YRK81jem0mCfkSHkoecUWOlx15EzSgIMzntV2tu3Z2BLJNGQWpQoNWvfvUGMUsjlIYJqnW34SbGz6gynAmcVnqpxoSyMR1i11JJI9R+Nks8JSdWGZAwVvZJQ2bq742MRlpPosBO5gn1opeL/3nd1ISXfsZlkhqUbP5RmApiYpKfTwZcITNiYgllitushI2ooszYkiq2hMbiycvEO6tf1d2781rzumijDEdwDKfQgAtowi20wAMGEp7hFd4c7bw4787HfLTkFDuH8AfO5w9WiJDf</latexit><latexit sha1_base64="KocwDc1L8B7UxHM0TciBMLBaFZQ=">AAAB8HicbVBNS8NAFHypX7V+VT16WSyCp5KKoIKHghePFYwttqFsti/t0s0m7G6EEvovvHhQ8erP8ea/cdPmoK0DC8PMe+y8CRLBtXHdb6e0srq2vlHerGxt7+zuVfcPHnScKoYei0WsOgHVKLhEz3AjsJMopFEgsB2Mb3K//YRK81jem0mCfkSHkoecUWOlx15EzSgIMzntV2tu3Z2BLJNGQWpQoNWvfvUGMUsjlIYJqnW34SbGz6gynAmcVnqpxoSyMR1i11JJI9R+Nks8JSdWGZAwVvZJQ2bq742MRlpPosBO5gn1opeL/3nd1ISXfsZlkhqUbP5RmApiYpKfTwZcITNiYgllitushI2ooszYkiq2hMbiycvEO6tf1d2781rzumijDEdwDKfQgAtowi20wAMGEp7hFd4c7bw4787HfLTkFDuH8AfO5w9WiJDf</latexit><latexit sha1_base64="KocwDc1L8B7UxHM0TciBMLBaFZQ=">AAAB8HicbVBNS8NAFHypX7V+VT16WSyCp5KKoIKHghePFYwttqFsti/t0s0m7G6EEvovvHhQ8erP8ea/cdPmoK0DC8PMe+y8CRLBtXHdb6e0srq2vlHerGxt7+zuVfcPHnScKoYei0WsOgHVKLhEz3AjsJMopFEgsB2Mb3K//YRK81jem0mCfkSHkoecUWOlx15EzSgIMzntV2tu3Z2BLJNGQWpQoNWvfvUGMUsjlIYJqnW34SbGz6gynAmcVnqpxoSyMR1i11JJI9R+Nks8JSdWGZAwVvZJQ2bq742MRlpPosBO5gn1opeL/3nd1ISXfsZlkhqUbP5RmApiYpKfTwZcITNiYgllitushI2ooszYkiq2hMbiycvEO6tf1d2781rzumijDEdwDKfQgAtowi20wAMGEp7hFd4c7bw4787HfLTkFDuH8AfO5w9WiJDf</latexit>

⌦ =Ne[

e=1

⌦e

<latexit sha1_base64="/CmCilk+RChMzOhbGe5PoVd0sV0=">AAACFHicbVDLSsNAFJ34rPUVdelmsAgupCQiqGCh4MaVVjC20NQwmd62Q2eSMDMRSshPuPFX3LhQcevCnX/j9LHQ1gMDh3PO5c49YcKZ0o7zbc3NLywuLRdWiqtr6xub9tb2nYpTScGjMY9lIyQKOIvA00xzaCQSiAg51MP+xdCvP4BULI5u9SCBliDdiHUYJdpIgX3oXwvoElzBfsi6NE18zgTTKsig4ub32VUAOR5nAgjsklN2RsCzxJ2QEpqgFthffjumqYBIU06UarpOolsZkZpRDnnRTxUkhPZJF5qGRkSAamWjq3K8b5Q27sTSvEjjkfp7IiNCqYEITVIQ3VPT3lD8z2umunPayliUpBoiOl7USTnWMR5WhNtMAtV8YAihkpm/YtojklBtiiyaEtzpk2eJd1Q+Kzs3x6Xq+aSNAtpFe+gAuegEVdElqiEPUfSIntErerOerBfr3foYR+esycwO+gPr8wc9VJ5x</latexit><latexit sha1_base64="/CmCilk+RChMzOhbGe5PoVd0sV0=">AAACFHicbVDLSsNAFJ34rPUVdelmsAgupCQiqGCh4MaVVjC20NQwmd62Q2eSMDMRSshPuPFX3LhQcevCnX/j9LHQ1gMDh3PO5c49YcKZ0o7zbc3NLywuLRdWiqtr6xub9tb2nYpTScGjMY9lIyQKOIvA00xzaCQSiAg51MP+xdCvP4BULI5u9SCBliDdiHUYJdpIgX3oXwvoElzBfsi6NE18zgTTKsig4ub32VUAOR5nAgjsklN2RsCzxJ2QEpqgFthffjumqYBIU06UarpOolsZkZpRDnnRTxUkhPZJF5qGRkSAamWjq3K8b5Q27sTSvEjjkfp7IiNCqYEITVIQ3VPT3lD8z2umunPayliUpBoiOl7USTnWMR5WhNtMAtV8YAihkpm/YtojklBtiiyaEtzpk2eJd1Q+Kzs3x6Xq+aSNAtpFe+gAuegEVdElqiEPUfSIntErerOerBfr3foYR+esycwO+gPr8wc9VJ5x</latexit><latexit sha1_base64="/CmCilk+RChMzOhbGe5PoVd0sV0=">AAACFHicbVDLSsNAFJ34rPUVdelmsAgupCQiqGCh4MaVVjC20NQwmd62Q2eSMDMRSshPuPFX3LhQcevCnX/j9LHQ1gMDh3PO5c49YcKZ0o7zbc3NLywuLRdWiqtr6xub9tb2nYpTScGjMY9lIyQKOIvA00xzaCQSiAg51MP+xdCvP4BULI5u9SCBliDdiHUYJdpIgX3oXwvoElzBfsi6NE18zgTTKsig4ub32VUAOR5nAgjsklN2RsCzxJ2QEpqgFthffjumqYBIU06UarpOolsZkZpRDnnRTxUkhPZJF5qGRkSAamWjq3K8b5Q27sTSvEjjkfp7IiNCqYEITVIQ3VPT3lD8z2umunPayliUpBoiOl7USTnWMR5WhNtMAtV8YAihkpm/YtojklBtiiyaEtzpk2eJd1Q+Kzs3x6Xq+aSNAtpFe+gAuegEVdElqiEPUfSIntErerOerBfr3foYR+esycwO+gPr8wc9VJ5x</latexit>

q(e)N (x, t) =

MX

j=1

j(x)q(e)j (t)

<latexit sha1_base64="NKU1BY5ZkOilVahwWn4VeeZoK7s=">AAACRnicbVBNSxtBGH43Wj9itak9ehkMQgISdkWoQgWhFy+VCI0K2WSZnczqxJnddeZdMSz773rx2lv/Qi89qHjtJFkhfjww8PC8z/sxT5hKYdB1/ziVufkPC4tLy9WVj6trn2qf109NkmnGOyyRiT4PqeFSxLyDAiU/TzWnKpT8LLz6Pq6f3XBtRBL/xFHKe4pexCISjKKVglrfVxQvwyi/LoLjft7gzaLxLN0W29gkB8Q3mfKlUAJNkA8PvKL/g/ipEcFwxtqcGTQsB2EzqNXdljsBeUu8ktShRDuo/fYHCcsUj5FJakzXc1Ps5VSjYJIXVT8zPKXsil7wrqUxVdz08kkOBdmyyoBEibYvRjJRZztyqowZqdA6x8ea17Wx+F6tm2G018tFnGbIYzZdFGWSYELGoZKB0JyhHFlCmRb2VsIuqaYMbfRVG4L3+stvSWentd9yT3brh9/KNJZgAzahAR58hUM4gjZ0gMEv+Av38ODcOf+cR+dpaq04Zc8XeIEK/AcBKbIr</latexit><latexit sha1_base64="NKU1BY5ZkOilVahwWn4VeeZoK7s=">AAACRnicbVBNSxtBGH43Wj9itak9ehkMQgISdkWoQgWhFy+VCI0K2WSZnczqxJnddeZdMSz773rx2lv/Qi89qHjtJFkhfjww8PC8z/sxT5hKYdB1/ziVufkPC4tLy9WVj6trn2qf109NkmnGOyyRiT4PqeFSxLyDAiU/TzWnKpT8LLz6Pq6f3XBtRBL/xFHKe4pexCISjKKVglrfVxQvwyi/LoLjft7gzaLxLN0W29gkB8Q3mfKlUAJNkA8PvKL/g/ipEcFwxtqcGTQsB2EzqNXdljsBeUu8ktShRDuo/fYHCcsUj5FJakzXc1Ps5VSjYJIXVT8zPKXsil7wrqUxVdz08kkOBdmyyoBEibYvRjJRZztyqowZqdA6x8ea17Wx+F6tm2G018tFnGbIYzZdFGWSYELGoZKB0JyhHFlCmRb2VsIuqaYMbfRVG4L3+stvSWentd9yT3brh9/KNJZgAzahAR58hUM4gjZ0gMEv+Av38ODcOf+cR+dpaq04Zc8XeIEK/AcBKbIr</latexit><latexit sha1_base64="NKU1BY5ZkOilVahwWn4VeeZoK7s=">AAACRnicbVBNSxtBGH43Wj9itak9ehkMQgISdkWoQgWhFy+VCI0K2WSZnczqxJnddeZdMSz773rx2lv/Qi89qHjtJFkhfjww8PC8z/sxT5hKYdB1/ziVufkPC4tLy9WVj6trn2qf109NkmnGOyyRiT4PqeFSxLyDAiU/TzWnKpT8LLz6Pq6f3XBtRBL/xFHKe4pexCISjKKVglrfVxQvwyi/LoLjft7gzaLxLN0W29gkB8Q3mfKlUAJNkA8PvKL/g/ipEcFwxtqcGTQsB2EzqNXdljsBeUu8ktShRDuo/fYHCcsUj5FJakzXc1Ps5VSjYJIXVT8zPKXsil7wrqUxVdz08kkOBdmyyoBEibYvRjJRZztyqowZqdA6x8ea17Wx+F6tm2G018tFnGbIYzZdFGWSYELGoZKB0JyhHFlCmRb2VsIuqaYMbfRVG4L3+stvSWentd9yT3brh9/KNJZgAzahAR58hUM4gjZ0gMEv+Av38ODcOf+cR+dpaq04Zc8XeIEK/AcBKbIr</latexit>

Domain decomposition Reference element

Basis functions - Lagrange polynomials

Legendre-Gauss-Lobatto points

-1 1-1

1

0

0

Approximate local solution as:

j(x)<latexit sha1_base64="7vWgw/n/+amCHhQHLcDzNDEdOXA=">AAAB+nicbVDLSsNAFL3xWesr1qWbwSLUTUmloO4KblxWMLbQhDCZTtqxkwczE2kJ+RU3LlTc+iXu/BsnbRbaemDgcM693DPHTziTyrK+jbX1jc2t7cpOdXdv/+DQPKo9yDgVhNok5rHo+1hSziJqK6Y47SeC4tDntOdPbgq/90SFZHF0r2YJdUM8iljACFZa8syak0jmPTacEKuxH2TT/Nwz61bTmgOtklZJ6lCi65lfzjAmaUgjRTiWctCyEuVmWChGOM2rTippgskEj+hA0wiHVLrZPHuOzrQyREEs9IsUmqu/NzIcSjkLfT1ZRJTLXiH+5w1SFVy5GYuSVNGILA4FKUcqRkURaMgEJYrPNMFEMJ0VkTEWmChdV1WX0Fr+8iqxL5rXTeuuXe+0yzYqcAKn0IAWXEIHbqELNhCYwjO8wpuRGy/Gu/GxGF0zyp1j+APj8wdYE5Qk</latexit><latexit sha1_base64="7vWgw/n/+amCHhQHLcDzNDEdOXA=">AAAB+nicbVDLSsNAFL3xWesr1qWbwSLUTUmloO4KblxWMLbQhDCZTtqxkwczE2kJ+RU3LlTc+iXu/BsnbRbaemDgcM693DPHTziTyrK+jbX1jc2t7cpOdXdv/+DQPKo9yDgVhNok5rHo+1hSziJqK6Y47SeC4tDntOdPbgq/90SFZHF0r2YJdUM8iljACFZa8syak0jmPTacEKuxH2TT/Nwz61bTmgOtklZJ6lCi65lfzjAmaUgjRTiWctCyEuVmWChGOM2rTippgskEj+hA0wiHVLrZPHuOzrQyREEs9IsUmqu/NzIcSjkLfT1ZRJTLXiH+5w1SFVy5GYuSVNGILA4FKUcqRkURaMgEJYrPNMFEMJ0VkTEWmChdV1WX0Fr+8iqxL5rXTeuuXe+0yzYqcAKn0IAWXEIHbqELNhCYwjO8wpuRGy/Gu/GxGF0zyp1j+APj8wdYE5Qk</latexit><latexit sha1_base64="7vWgw/n/+amCHhQHLcDzNDEdOXA=">AAAB+nicbVDLSsNAFL3xWesr1qWbwSLUTUmloO4KblxWMLbQhDCZTtqxkwczE2kJ+RU3LlTc+iXu/BsnbRbaemDgcM693DPHTziTyrK+jbX1jc2t7cpOdXdv/+DQPKo9yDgVhNok5rHo+1hSziJqK6Y47SeC4tDntOdPbgq/90SFZHF0r2YJdUM8iljACFZa8syak0jmPTacEKuxH2TT/Nwz61bTmgOtklZJ6lCi65lfzjAmaUgjRTiWctCyEuVmWChGOM2rTippgskEj+hA0wiHVLrZPHuOzrQyREEs9IsUmqu/NzIcSjkLfT1ZRJTLXiH+5w1SFVy5GYuSVNGILA4FKUcqRkURaMgEJYrPNMFEMJ0VkTEWmChdV1WX0Fr+8iqxL5rXTeuuXe+0yzYqcAKn0IAWXEIHbqELNhCYwjO8wpuRGy/Gu/GxGF0zyp1j+APj8wdYE5Qk</latexit>

9

Bottom line: choose methodologies that extend the shelf-life of the model (e.g., offer new capabilities).

Page 10: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

NUMA with dynamically adaptive mesh refinement: Uniform simulation (left) is 2-4x slower than AMR simulation (right) on 200 processors with 1 million DOF

GNuME Numerics: Future Capabilities

10 10

WRF HIGRAD WRF

NUMA-LF NUMA-LFNUMA-ARK2 NUMA-ARK2

Page 11: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

Gr Sc Frno-slip free-slip

NUMO 2D 1.25x106 6.74 0.420 0.482Hiester et al. (2011) 2D 1.25x106 - 0.417 0.482Fringer et al. (2006) 2D 1.25x106 - 0.396 0.428

Simpson & Britter (1979) EXP 4.8x106 ~7 0.432 -NUMO 2D 1.25x106 0.71 0.407 0.475

Hartel et al. (2000) 2D 1.25x106 0.71 0.406 0.477Cantero et al. (2007) 3D 1.5x106 0.71 0.407 -

Fr =uq

�⇢⇢0

gH

Gr =g�⇢H

3

⌫2Sc =

T

inertiagravity

=buoyancyviscosity

=viscositydiffusivity

=

free-slip

no-slip

distance from initial position (m)

GNuME Numerics: NUMO Lock Exchange

11

Page 12: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

Geometric flexibility (allows for unstructured/non-orthogonal grids, adaptivity/consistent nesting)

allows for focused regions (not just with grid spacing but also order of accuracy)

GNuME: Global Shallow Water

12Marras et al. QJRMS 2015

Page 13: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

Current ESMs run adequately on CPU-only computers (X86).

E.g., NUMA on Mira at 3 km global resolution (dynamics only)

Mira (ALCF) is an IBM BG/Q with 786,432 cores each with 4 threads with a peak performance of 10 petaflops.

GNuME: HPC Landscape

13Mueller et al. IJHPCA 2017

Page 14: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

Current HPC Landscape: it has changed to hybrid

GNuME: HPC Landscape

14[top500.org. June 2017]

Rank Name Theoretical Peak PFlops # of Cores Hardware Country

1 Sunway TaihuLight 125 10,649,600 260-core

manycore China

2 Tianhe-2 54 3,120,000 5-core Intel Xeon Phi (KNC) China

3 Piz-Daint 25 361,760Intel Xeon Phi+Nvidia

P100Switzerland

4 Titan 27 560,640 AMD Opteron + Nvidia K20 U.S.

5 Sequoia 27 1,572,864 IBM BG/Q U.S.

Page 15: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

Near Future HPC Landscape: it has changed to hybrid

GNuME: HPC Landscape

Rank Name Theoretical Peak PFlops # of Cores Hardware Country

1 Aurora 1000+ ? Not KNH U.S.

2 Summit 207 230,000IBM Power 9 +

Nvidia Volta GPUs

U.S.

3 Sunway TaihuLight 125 10,649,600 260-core

manycore China

4 Tianhe-2 54 3,120,000 5-core Intel Xeon Phi (KNC) China

5 Piz-Daint 25 361,760 Intel Xeon Phi+Nvidia P100 Switzerland

6 Titan 27 560,640 AMD Opteron + Nvidia K20 U.S.

7 Sequoia 27 1,572,864 IBM BG/Q U.S.

15

Page 16: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

Many APIs exist for accessing manycore hardware:

OpenACC, OpenMP, CUDA, OpenCL, Kokkos, OCCA.

As a domain scientist, perhaps easier to use DSLs (e.g.,

Gridtools, etc.).

Our strategy has been to control as much as possible and

deal directly with arrays (array of structures).

16

GNuME: HPC Landscape

Page 17: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

How to harness different hardware

In the Hardware-agnostic approach, the idea is to write the compute-kernels once and offload them to different back-ends for varying architectures (figure shows the idea). This solution offers portability but no guarantee on performance.

The performance comes from tuning at the Kernel Language level (code snippet below).

GNuME: HPC Landscape

17

*[Courtesy of Tim Warburton]

OpenMP

CUDA

Page 18: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

How to access different hardware

Example of NUMA hardware-agnostic approach (with Occa1) on GPUs and Intel Xeon Phis. Same code-base achieves 90% (weak) scaling efficiency.

On GPU, we use a node-per-thread approach whereas on many-core an element-per thread approach (via myParallelFor)

GNuME: HPC Landscape

18

Titan OLFC: GPUs

Results with NUMA from [Abdi et al. IJHPCA 2017]

Intel KNL

10−3

10−2

10−1

100

101

102

103

10−1

100

101

102

103

104

GFLOPS/GB

GF

LO

PS

/s

334

GB/s

1707 GFLOPS/s

Volume kernelUpdate Kernel of ARKGradient kernelDiffusion kernelextract_q_gmres_schurcreate_lhs_gmres_schur_set2ccreate_rhs_gmres_schurRoofline

Nvidia K20X GPU

[1] Medina et al, arXiv:1403.0968} (2014)

Page 19: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

Talk Summary

Caveats

Lessons Learned via Development of GNuME

Summary of Lessons Learned

Where to next

19

Page 20: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

Our driving principle has been simplicity (readability) over performance.

Have also chosen approaches that extend the shelf-life of the model (e.g., numerics, hardware-agnosticism, etc.).

Balance must be struck between rewriting code (a good thing) and reusing code as much as possible.

Choice of Programming Language is important but a balance must be struck between maintainability (developer) and readability (user). Ideally would like one language for both prototyping and deployment (e.g., Firedrake, Julia).

For OpenSource, need a simple enough language with a large user-base in ESM community.

DSLs can insulate domain scientists from much of the software engineering complexities and speed-up progress (my group has not tried this approach, yet).

Summary

20

Page 21: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

Talk Summary

Caveats

Development of GNuME

Summary of Lessons Learned

Where to next

21

Page 22: Lessons Learned on the Development of a Flexible Software ... · Same code-base achieves 90% (weak) scaling efficiency. On GPU, we use a node-per-thread approach whereas on many-core

In collaboration with Caltech (Andrew Stuart’s talk ), MIT (Raffaele Ferrari’s talk), and JPL, we will develop an infrastructure for running both global atmosphere and a multitude of hi-res LES domains (with data) for teaching the global model.

To do so, we will redesign the code in a hardware agnostic approach (e.g., OCCA), with compute-kernels written in a variety of languages (use existing functions), managed by a Python/Julia workflow.

Open source from the beginning (e.g., via Github).

Numerical methods will stay the same (e.g., element-based Galerkin methods, time-integrators, non-conforming grids).

Create a software infrastructure that allows for specific components (e.g., ESM vs CFD) with specific builds. To make it easier for collaborators to add components and streamline the workflow and increase (work) performance.

Where to Next