MM5 Optimization Experiences and Numerical Sensitivities Found in Convective/Non-Convective Cloud Interactions Carlie J. Coats, Jr., MCNC ([email protected])

MM5 Optimization Experiences and

Numerical Sensitivities Found in Convective/Non-Convective

Cloud Interactions

• Carlie J. Coats, Jr., MCNC ([email protected])

• John N. McHenry, MCNC ([email protected])• Elizabeth Hayes, SGI ([email protected])

Introduction

• MM5 Optimization for Microprocessor/Parallel Systems

• Started from MM5V2.[7,12]-GSPBL

• Speedups so far: 1.4 on SGI, 1.9 on Linux/X86, 2.36 on IBM SP

• Tiny numerical changes cause gross changes in the output

– (but these seem to be unbiased)

• Causative mechanisms include convective triggering

– inherent problem; this is ill-conditioned in nature

• Need to be careful with algorithmic formulations and optimizations

– will not be fixed simply by improved compiler technology

Optimization For Microprocessor/Parallel

• Processor characteristics:

– Pipelining and Superscalarity—need lots of independent work

– Hierarchical memory organization with registers and caches

• Solutions:

– Data structure transformations

– Logic and loop re-factoring

– Expensive-operation avoidance

– Minimize and optimize memory traffic

Pipelining and Superscalarity

• Modern microprocessors try to have multiple instructions in different stages of execution on each FPU or ALU at the same time.

• Dependencies between instructions (where one needs to complete before another can start) stall the system.

• Current technology: 20-30 instructions "in flight" at one time; even more (50+?) instructions in the future.

• Standard solutions: need lots of “independent work” to fill pipelines

– Loop unrolling for vectorizable loops (some compilers can do this)

– Loop jamming, so that there are long loop bodies with lots of independent work (some compilers can do some of this)

– Logic refactoring, so that IFs are outside the loops, not inside (compilers can NOT do this)

Caches and Memory Traffic

• Memory traffic a prime predictor for performance

– McCalpin's "STREAM" benchmarks

• Want stride 1 data access, especially for “store” sequences

• Want small data structures that “live in cache” or (where possible) even scalars that “live in registers.”

• Parallel cache-line conflicts "can cost 100X performance"--SGI

• Standard solutions:

– Loop unrolling and loop jamming lead to value re-use (some compilers can do some of this)

– Loop refactoring and data structure reorganization (some compilers can do loop refactoring but none do major data structure reorganization)

Expensive Operations

• Use of X**0.5 instead of SQRT(X) (this is also less accurate)

• use of divides and reciprocals

– we can see even examples of X=A/B/C/D in the code, instead of X=A/(B*C*D)

– use RPS* variables

– rationalize fractions

• EXP(A)*EXP(B) vs. EXP(A+B) (happens in LWRAD)

• repeated calculations of the same trig or log functions (happens in SOUND)

Logic Re-Factoring

•Simplified example adapted from MRFPBL:DO K=1,KLDO I=1,ILX QX(I,K) =QVB(I,J,K)*RPSB(I,J) QCX(I,K)=0. QIX(I,K)=0.END DOEND DOIF ( IMOIST(IN).NE.1)THEN DO K=1,KL DO I=1,ILX QCX(I,K)=QCB(I,J,K)*RPSB(I,J) IF(IICE.EQ.1)QIX(I,K)=QIB(I,J,K)*RPSB(I,J) END DO END DOEND IF

• IF ( IMOIST(IN).EQ.1)THEN• DO K=1,KL• DO I=1,ILX• QX(I,K) =QVB(I,J,K)*RPSB(I,J)• QCX(I,K)=0.• QIX(I,K)=0.• END DO• END DO• ELSE IF ( IICE.NE.1)THEN ! where imoist.ne.1:• DO K=1,KL• DO I=1,ILX• QX(I,K) =QVB(I,J,K)*RPSB(I,J)• QCX(I,K)=QCB(I,J,K)*RPSB(I,J)• QIX(I,K)=0.• END DO• END DO• ELSE ! imoist.ne.1 and iice.eq.1• DO K=1,KL• DO I=1,ILX• QX(I,K) =QVB(I,J,K)*RPSB(I,J)• QCX(I,K)=QCB(I,J,K)*RPSB(I,J)• QIX(I,K)=QIB(I,J,K)*RPSB(I,J)• END DO• END DO• END IF

EXMOISS Optimizations

• Inside the (innermost) miter loop:RGV(K) =AMAX1( RGV(K)/DSIGMA(K), RGV(K-1)/DSIGMA(K-1) )*DSIGMA(K)RGVC(K)=AMAX1(RGVC(K)/DSIGMA(K),RGVC(K-1)/DSIGMA(K-1) )*DSIGMA(K)

• Equivalent toDSRAT(K)=DSIGMA(K)/DSIGMA(K-1)) !! K-only pre-calculation…..RGV(K) =AMAX1( RGV(K), RGV(K-1)*DSRAT(K) )RGVC(K)=AMAX1(RGVC(K), RGVC(K-1)*DSRAT(K) )

• Rewrite loop structure and arrays as follows:

– outermost I-loop, enclosing– sequence of K-loops, then– miter loop, enclosing internal K-loop– working arrays subscripted by K only (or are scalars, when possible)

EXMOISS Optimizations, cont’d

• Rain-accumulation numerics:

– original adds one miter-step of one layer of rain to 2-D array of cumulative rain totals; serious truncation error for long runs.

– Optimized version adds up vertical-column advection-step total in a scalar; then adds that scalar to the cumulative total—better round-off, less memory traffic.

• New version is twice as fast, greatly reduced round-off errors

• Generates noticeably different MM5 model results

– no evident bias in the changed results

– caused/amplified by interaction with convective cloud parameterizations?

– See plots to come

24 Hour Forecast Grid RMS RN

0.00E+00

2.00E-01

4.00E-01

6.00E-01

8.00E-01

1.00E+00

1.20E+00

1.40E+00

1.60E+00

1 3 5 7 9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

Half-Hour Time Steps

ce

nti

me

ters

Day 89 Day 90 Day 91 Day 92 Day 94 Day 95 Day 96 Day 97

24-Hour Forecasts -- Grid RMS difference: KFTOP Base - Optimized

0.00E+00

2.00E+02

4.00E+02

6.00E+02

8.00E+02

1.00E+03

1.20E+03

1.40E+03

1.60E+03

1.80E+031 3 5 7 9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49


Me

ters


24-Hour Forecasts -- Grid AVERAGE difference: KFTOPBase - Optimized

-3.00E+01

-2.00E+01

-1.00E+01

0.00E+00

1.00E+01

2.00E+01

3.00E+01

4.00E+01

5.00E+011 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49


Me

ters


24-Hour Forecasts -- Grid MAX Difference RN: Base - Optimized

0.00E+00

5.00E-01

1.00E+00

1.50E+00

2.00E+00

2.50E+00

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49


cen

tim

eter

s


24-Hour Forecast -- Grid MIN Difference RN: Base - Optimized

-2.50E+00

-2.00E+00

-1.50E+00

-1.00E+00

-5.00E-01

0.00E+00

1 3 5 7 9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49


cen

tim

eter

s


24 Hour Forecast Grid RMS Difference RN: Base - Optimized

0.00E+00

1.00E-02

2.00E-02

3.00E-02

4.00E-02

5.00E-02

6.00E-02

7.00E-02

8.00E-02

9.00E-02

1.00E-01

1 3 5 7 9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49


cen

tim

eter

s


24-Hour Forecast Grid MEAN Difference RN: Base - Optimized

-1.50E-03

-1.00E-03

-5.00E-04

0.00E+00

5.00E-04

1.00E-03

1.50E-03

1 3 5 7 9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49


cen

tim

eter

s


24-Hour Forecasts -- Grid MAX Difference RC: Base - Optimized

0.00E+00

5.00E-01

1.00E+00

1.50E+00

2.00E+00

2.50E+00

1 3 5 7 9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49


cen

tim

eter

s


24 Hour Forecast Grid RMS Difference RC Base - Optimized

0.00E+00

1.00E-02

2.00E-02

3.00E-02

4.00E-02

5.00E-02

6.00E-02

7.00E-02

8.00E-02

9.00E-02

1 3 5 7 9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49


cen

tim

eter

s


24-Hour Forecast Grid MEAN Difference RC: Base - Optimized

-1.00E-03

-8.00E-04

-6.00E-04

-4.00E-04

-2.00E-04

0.00E+00

2.00E-04

4.00E-04

6.00E-04

8.00E-04

1 3 5 7 9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49


cen

tim

eter

s


Other Routines

• Routines:

SOUND, SOLVE3, EXMOISS, GSPBL, LWRAD, MRFPBL, HADV, VADV

• Typical speedup factors for these routines

– 1.1-1.6 on SGI,

– 1.5-2.1 (but 2.54 for GSPBL) on IBM SP

• Frequently, optimized versions have reduced round-off

• Some optimizations will improve both vector a microprocessor performance

• Side effects: reduced cache footprint in EXMOISS, MRFPBL caused 5-8% speedup in SOUND, SOLVE3 on SGI Octane! (less effect on O-2000)

Food for Thought

• What does all this—especially the numerical sensitivities—say for future model formulations such as WRF?

– Double-precision-only model? (and best-available values for physics constants!)

– Ensemble forecasts? (These are very easy to achieve with the current MM5—just multiply some state variable by PSA, then by RPSA! )

– (Most radically) stochastic models that predict cell means and variances instead of deterministic point-values? (Due to theorems in integral operator theory, these have better stability and continuity properties than today’s deterministic models but sub-gridscale processes will be a challenge to formulate!)

Documents

MM5 Optimization Experiences and Numerical Sensitivities Found in Convective/Non-Convective Cloud Interactions Carlie J. Coats, Jr., MCNC ([email protected])