August 2001 Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System...

Preview:

Citation preview

August 2001

Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling

System (SMS)

Dan Schaffer

NOAA Forecast Systems Laboratory (FSL)

August 2001

August 2001

Outline

• Who we are

• Intro to SMS

• Application of SMS to ROMS

• Ongoing Work

• Conclusion

August 2001

Who we are

• Mark Govett

• Leslie Hart

• Tom Henderson

• Jacques Middlecoff

• Dan Schaffer

• Developing SMS for 20+ man years

August 2001

Intro to SMS

• Overview– Directive based

• FORTRAN comments

• Enables single source parallelization

– Distributed or shared memory machines– Performance portability

August 2001

Distributed Memory Parallelism

August 2001

Add SMSDirectives

Code Parallelization using SMS

SMS Serial Code

SMS Parallel

Code

OriginalSerialCode

PPPParallel Pre-Processor

SerialExecutable

ParallelExecutable

August 2001

Low-Level SMS

MPI,

SHMEM, etc.

NNT SRSSSTFDA Library Spectral Library Parallel I/O

SMS Parallel

Code

August 2001

Intro to SMS (contd)

– Support for all of F77 plus much of F90 including:• Dynamic memory allocation

• Modules (partially supported)

• User-defined types

– Supported Machines• COMPAQ Alpha-Linux Cluster (FSL “Jet”)

• PC-Linux Cluster

• SUN Sparcstation

• SGI Origin 2000

• IBM SP-2

August 2001

Intro to SMS (contd)

• Models Parallelized– Ocean : ROMS, HYCOM, POM – Mesoscale Weather : FSL RUC, FSL QNH,

NWS Eta, Taiwan TFS (Nested)– Global Weather : Taiwan GFS (Spectral)– Atmospheric Chemistry : NOAA Aeronomy

Lab

August 2001

Key SMS Directives

• Data Decomposition– csms$declare_decomp

– csms$create_decomp

– csms$distribute

• Communication– csms$exchange

– csms$reduce

• Index Translation– csms$parallel

• Incremental Parallelization – csms$serial

• Performance Tuning– csms$flush_output

• Debugging Support– csms$reduce (bitwise exact)

– csms$compare_var

– csms$check_halo

August 2001

SMS Serial Code program DYNAMIC_MEMORY_EXAMPLE parameter(IM = 15) CSMS$DECLARE_DECOMP(my_dh) CSMS$DISTRIBUTE(my_dh, 1) BEGIN real, allocatable :: x(:) real, allocatable :: y(:) real xsum CSMS$DISTRIBUTE END CSMS$CREATE_DECOMP (my_dh, <IM>, <2>) allocate(x(im)) allocate(y(im)) open (10, file = 'x_in.dat', form='unformatted') read (10) x CSMS$PARALLEL(my_dh, <i>) BEGIN do 100 i = 3, 13 y(i) = x(i) - x(i-1) - x(i+1) - x(i-2) - x(i+2) 100 continue CSMS$EXCHANGE(y) do 200 i = 3, 13 x(i) = y(i) + y(i-1) + y(i+1) + y(i-2) + y(i+2) 200 continue xsum = 0.0 do 300 i = 1, 15 xsum = xsum + x(i) 300 continue CSMS$REDUCE(xsum, SUM) CSMS$PARALLEL END print *,'xsum = ',xsum end

August 2001

Advanced Features

• Nesting• Incremental Parallelization• Debugging Support (Run-time configurable)

– CSMS$REDUCE• Enables bit-wise exact reductions

– CSMS$CHECK_HALO • Verifies a halo region is up-to-date

– CSMS$COMPARE_VAR• Compare variables for simultaneous runs with different

numbers of processors• HYCOM 1-D decomp parallelized in 9 days

August 2001

Incremental Parallelization

“global” “local”

“local” “global”

CALL NOT_PARALLEL(...)

SMS Directive: CSMS$SERIAL

August 2001

Advanced Features (contd)

• Overlapping Output with Computations (FORTRAN Style I/O only)

• Run-time Process Configuration– Specify

• number of processors per decomposed dim or

• number of grid points per processor

• 15% performance boost for HYCOM

– Support for irregular grids coming soon

August 2001

SMS Performance (Eta)

• Eta model run in production at NCEP for use in National Weather Service Forecasts

• 16000 Lines of Code (excluding comments)

• 198 SMS Directives added to the code

August 2001

ETA Performance• Performance measured on NCEP SP2• I/O excluded• Resolution : 223x365x45• 88 PE run-time beats NCEP hand-coded MPI by 1%• 88 PE Exchange time beats hand-coded MPI by 17%

Processors Time (sec.) Efficiency

4 406 1.00

16 103 0.99

64 29.3 0.86

88 23.9 0.80

August 2001

SMS Performance (HYCOM)

• 4500 Lines of Code (excluding comments)

• 108 openMP directives included in the code

• 143 SMS Directives added to the code

August 2001

HYCOM Performance

• Performance measured on O2K• Resolution : 135x256x14 • Serial code runs in 136 seconds

Procs openMP Time

Efficiency SMS Time Efficiency

1 142 0.96 127 1.07

8 22.6 0.75 14.5 1.17

16 12.9 0.66 7.60 1.18

August 2001

Intro to SMS (contd)

– Extensive documentation available on the web

– New development aided by• Regression test suite

• Web-based bug tracking system

August 2001

Outline

• Who we are

• Intro to SMS

• Application of SMS to ROMS

• Ongoing Work

• Conclusion

August 2001

SMS ROMS Implementation

• Used awk and cpp to convert to dynamic memory; simplifying SMS parallelization

• Leveraged existing shared memory parallelism do I = ISTR, IEND

• Directives added to handle NEP scenario • 13000 Lines of Code, 132 SMS directives• Handled netCDF I/O with CSMS$SERIAL

August 2001

Results and Performance

• Runs and produces correct answer on all supported SMS machines

• Low Resolution 128x128x30– “Jet”, O2K, T3E Scaling

– Run-times for main loop (21 time steps) excluding I/O

• High Resolution 210x550x30– PMEL using in production

– 97% Efficiency between 8 and 16 processors on “Jet”

August 2001

SMS Low Res ROMS “Jet” Performance

Processors Time (sec.) Efficiency

1

(serial code)

153 1.00

4 41.3 0.93

8 21.6 0.89

16 12.6 0.76

August 2001

SMS Low Res ROMS O2K Performance

Processors Time (sec.) Efficiency

1

(serial code)

298 1.00

8 41.6 0.90

16 22.4 0.83

August 2001

SMS Low Res ROMS T3E Performance

Processors Time (sec.) Efficiency

8 63.2 1.00

16 35.8 0.88

32 19.5 0.81

August 2001

Outline

• Who we are

• Intro to SMS

• Application of SMS to ROMS

• Ongoing Work

• Conclusion

August 2001

Ongoing Work (funding dependent)

• Full F90 Support

• Support for parallel netCDF

• T3E port

• SHMEM implementation on T3E, O2K

• Parallelize other ROMS scenarios

• Implement SMS nested ROMS

• Implement SMS coupled ROMS/COAMPS

August 2001

Conclusion

• SMS is a high level directive-based tool

• Simple single source parallelization

• Performance optimizations provided

• Strong debugging support included

• Performance beats hand-coded MPI

• SMS is performance portable

August 2001

Web-Site

www-ad.fsl.noaa.gov/ac/sms.html