Hybrid Strategies for the NEMO Ocean Model on Many-core ... · • IBM PPC970 (MareNostrum at BSC)!...

Hybrid Strategies for the NEMO Ocean Model on !Many-core Processors!

I. Epicoco, S. Mocavero, G. Aloisio!University of Salento & CMCC, Italy!

2012 Programming weather, climate, and earth-system models on heterogeneous multi-core platforms!

Boulder (CO) Sept. 12-13, 2012!

Introduction!

NEMO uses!• Time discretization based on a three level scheme where

the tendency term is evaluated centered in time, forward or backward depending on the nature of the term!•  Space discretization using the Arakawa C grid!

•  vertical discretization is based on full step or partial step z- or s-coordinates!•  for tracers and momentum explicit, split-explicit and

filtered free surface formulations are implemented !

NEMO is characterized by:!•  Intensive memory usage !•  2D MPI parallelization!•  Cross communication pattern (using point-to-point calls).!

Evaluating different hybrid parallel approaches for multi-core architectures:!

•  Reducing the communication time!

•  Exploiting the shared memory!

•  Exploiting the memory hierarchy!

Test case: Advection schema for tracers!

The implementation follows this steps:!-  Horizontal advective fluxes!

-  First guess of the slopes!-  Boundaries exchange among processes!-  Evaluation of the slopes!-  Evaluation of the MUSCL fluxes!-  Boundaries exchange among processes!

-  Vertical advective fluxes!-  Evaluation of the slopes!-  Evaluation of the MUSCL fluxes!

The implementation uses 3 nested loops along i, j and k directions!The MPI communication uses a cross pattern (with point-to-point calls)!

In the MUSCL formulation (traadv_muscl.f90), the tracer is evaluated at velocity points assuming a linear tracer variation between two adjacent T-points!

Outer loop parallelization!

Only the loops along k are distributed among the parallel threads:!

PROs:!-  It is extremely simple to implement!-  It introduces minimum amount of parallel overhead!

CONs:!- The parallelism is limited to the number of vertical levels (O(102)) !

Tile based paralleization!

Each MPI sub-domain is divided into 3D tiles; the available tiles are distributed among the parallel threads!

PROs:!-  It allows a fine grained parallelism

(up to a single grid point)!

CONs:!- The cache misses are greater

since the approach does not preserve a contiguous memory access!

Loops Merge parallelization!

The 3 nested loops are merged together and all of the iteration are distributed among the threads!

PROs:!-  It allows a fine grained parallelism (up to a single grid point)!

CONs:!- It requires a deep revision of the arrays indexes !

Architectures!•  IBM P6 (Calypso at CMCC)!

•  P575 nodes with 16 dual-core chips!•  4 QUAD with 4 processors each !•  Memory: 128GB per node!•  Infiniband 4x DDR interconnection!

•  IBM iDataPlex (PLX at Cineca)!•  iDataPlex M3 nodes with 2 6-core chips!•  Memory: 48GB per node!•  Interconnection: Infiniband with 4x QDR switches!

•  IBM PPC970 (MareNostrum at BSC)!•  P615 nodes with 2 dual-core processor!•  Memory: 8GB per node!•  Interconnection: Myrinet Network!

•  CRAY XE6 (Hector at STFC)!•  XE6 nodes with 2 16-core chips!•  4 NUMA regions with 8 cores each !•  Memory: 32GB per node!•  Interconnection: Cray Gemini communication chips!

Results on IBM P6 – Calypso at CMCC!

Results on IBM iDataPlex – PLX at CINECA!

Results on CRAY XE6 – HECToR at STFC!

Results on IBM PPC970 – MareNostrum at BSC!

Conclusion!•  The OpenMP parallelization provides a further level of

parallelism (along vertical levels) in a relatively easy way!

•  The benefit of OpenMP parallelization can be appreciated only for high level of parallelism!

•  The benefit given by the hybrid parallelization strongly depends by the architecture and by the MPI implementation !

•  The bottleneck for the analyzed advection schema is due to the weight of the communication time over the computing time. The communication time decreases linearly with the dimension of the sub-domain, the computing time decrease squarely!

Next Steps!•  To optimize the memory access!

Credits!•  Italo Epicoco - CMCC!•  Silvia Mocavero - CMCC!•  Andrew Porter - STFC!•  Stephen Pickles - STFC!•  Mike Ashoworth - STFC!•  Giovanni Aloisio - CMCC!

Thanks!

Hybrid Strategies for the NEMO Ocean Model on Many-core ... · • IBM PPC970 (MareNostrum at BSC)!...

Documents

P8M/P715/P615 SERIES - TCL...3 Chapter 1 Safety Information Precautions Read all of the instructions before operating the set. Keep these instructions well for future use. Warning

Design and Test Solutions for Networks-on-ChipBE%C8%C1%F8%C8%A3.pdf · ASB core core core core core core core core Shared Bus core core core Bridge core core Hierarchy or Bridged

PRACE: European HPC · MareNostrum (ES) + 92 on SuperMUC (DE) for DE 40 million core hours on CURIE TN (FR) for France 68million core hours on FERMI (IT) for IT 80 million core hours

SERIE P600 - dostmann-electronic.de · Operation Manual SERIE Precision Measuring Instruments P600, P605, P610, P615, P650, P655, P670, P655-LOG DOSTMANN electronic P600 ® DE P 600

MareNostrum building and running the system-b · 3 MareNostrum: Building and running the system -Lisbon, August 29 th, 2005 Grid @ Large workshop Barcelona Supercomputing Center •

New U@MARENOSTRUM leaflet

10th EFAS Congress · Joint chair: B. Magnan & J. Vega Room: Marenostrum (VIDEO based presentations) “My technique and tips for….” 14:00-14:10 Posterior ankle endoscopy. N

arXiv:0805.2270v1 [astro-ph] 15 May 2008 · Model clusters The MareNostrum clusters (MN-clusters) were extracted from the 500 h−1 Mpc simulation box MUWHS [10] with cosmological

Barcelona Supercomputing Center · – Universitat Politècnica de Catalunya (UPC) 10% 425 people, 41 countries. 3 The MareNostrum 3 Supercomputer 70% distributed through PRACE 24%

· Web viewGRAPHIC ART (p615/3). This is refers to the use of both visual images and letters of different styles to communicate or express ideas or message. Letters, images and

Introduction to MareNostrum IV - BSC-CNS · Barcelona, September 27th 2017 Carlos Tripiana Oscar Hernández David Vicente

flyer Cullera3 MareNostrum - arroceando.com · Title: flyer_Cullera3_MareNostrum Created Date: 11/28/2013 3:35:49 PM

Optical Network Planning - BME Hálózati Rendszerek …jakab/Papers/1999/COST266_P709pre.pdf · “The Broadband Networks of Tomorrow” 1 Roberto CLEMENTE CSELT P615 Evolution

A unique place - CIREScires.colorado.edu/jimenez-group/Field/DAURE-09/BSC_1st_Mtg.pdfLinux cluster (SuSe). Diskless network support. MareNostrum MareNostrum MareNostrum’s evolution:

ECE 574 { Cluster Computing Lecture 2web.eece.maine.edu/~vweaver/classes/ece574_2019s/... · 25 MareNostrum (Lenovo) Spain x86/SKL 153,216 6/10 Xeon Phi 1,632 3. 4. Top500 List Notes

D2.5 Report on the HPC application bottlenecksexanode.eu/wp-content/uploads/2017/04/D2.5.pdf · correspond to a single MareNostrum node, while 1024 processes are marked by the benchmark

CS0441 Discrete Structures Recitation 7people.cs.pitt.edu/~xiangxiao/cs0441/Recitation-8.pdf · P615 Q16 Let R be the relation on the set of ordered pairs of positive integers such

Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

From Shader Code to a Teraﬂop: How Shader Cores Worksmidkiff/ece563/slides/GPU.pdfShader Core Shader Core Shader Core Shader Core Shader Core Shader Core Shader Core Shader Core

NG9-1-1 Town Hall Meeting - vita.virginia.gov - VITA · AT&T OPT-E-WAN . Core . Core . re Core . Core . Core . Core . Core . Core . Co Core . Core . Edge . dge Edge . Edge . EdgeR