[IEEE 2005 Users Group Conference (DOD-UGC'05) - Nashville, TN, USA (27-30 June 2005)] 2005 Users Group Conference (DOD-UGC'05) - Benchmarking NRL's SGI Altix

Benchmarking NRL’s SGI Altix

Wendell Anderson and Robert Rosenberg

Naval Research Laboratory, Code 5593, (NRL-

DC), Washington, DC {Wendell.Anderson,

Robert.Roseneberg}@nrl.navy.mil

Marco Lanzagorta

Scientific & Engineering Solutions, Washington,

[email protected]

Abstract

For the past three years the Naval Research

Laboratory has been evaluating the Silicon Graphics Inc (SGI) Altix machine as a platform for high performance

computing within the Department of Defense (DoD). This

effort has been in two parts: first on systems with 1.3 GHz IA-64 processors with 3 MB cache and a NumaLink3

network architecture, and then on a system with 1.6 GHz IA-64 processors with 9 MB of cache and a NumaLink4

network architecture. As part of this work NRL has been

porting both internally developed codes and the TI-05 benchmark codes to the Altix and evaluating their

performance. Performance results of running codes on

an Altix system with the 1.3 GHz processors and a RedHat Linux based operating system and an Altix system

with 1.6 GHz processors and a SUSE based Linux based

operating system are presented.

1. Introduction

Since December 2002, the US Naval Research

Laboratory (NRL) has been evaluating under early access

programs SGI Altix systems. Starting from an initial

system of 16 900 MHz Itanium II processors with 16 GB

of memory, through the two current systems one with 128

1.3 GHz processors and 256 GB of memory, and the other

with 256 processors 1.6 GHz processors and 2 TB of

memory. During this period, NRL has always worked at

the leading edge of the Altix in software, currently

running Propack 3 (RedHat Linux based) on the 128

processor and alternately either Propack 3.3 (RedHat

Linux based) or ProPack 4 (SUSE based) on the 256

processor system.

2. Problems and Methodology

The work discussed in this paper is part of our

ongoing efforts to benchmark the NRL’s SGI Altix 3000

computer. In 2004, we reported on some of the

computational advantages and disadvantages of this

computer, using as a benchmark a suite of scientific codes

developed by NRL scientists as part of their research.[1]

However, in order to better evaluate the performance

of the SGI Altix system in a more extensive and

challenging computational environment, a more robust

benchmark was needed. Currently, we are using a

combination of the 2005 Technology Insertion (TI-05)

benchmarks and scientific codes developed at NRL.

2.1. The TI-05 benchmark

The TI-05 benchmarks are used by the Department of

Defense High Performance Computing Modernization

Program (HPCMP) to ensure that their supercomputing

assets meet the needs of the DoD community that they

serve. In order to meet the growing needs of its users, the

HPCMP needs to continually upgrade its computers to

maintain the current state of the art in computing

resources. Part of this evaluation is a set of application

codes, representing a sample of actual HPC scientific

codes based on expected usage and algorithmic

considerations. We studied six of the eight application

codes from the TI-05 benchmark distribution: GAMESS,

HYCOM, OOCore, Overflow2, RF-CTH2, and WRF.

Together, they cover diverse areas of interest to DoD,

including computational fluid dynamics, computational

chemistry, and weather and ocean modeling.

The General Atomic and Molecular Electronic

Structure System (GAMESS) is a code that simulates

general ab-initio quantum chemistry. Several molecular

properties, such as dipole moments and frequency

Proceedings of the Users Group Conference (DOD-UGC’05) 0-7695-2496-6/05 $20.00 © 2005 IEEE

dependent hyper-polarizabilities, can be computed with

GAMESS.

HYCOM stands for Hybrid Coordinate Ocean Model

and encodes a primitive equation for the numerical

simulation of a general ocean circulation model. This

code can be used as a global ocean data assimilation

system or as the ocean component of a coupled ocean-

atmosphere model.

The out-of-core (OOCore) code uses ScaLAPACK,

BLACS, and BLAS libraries to factor and invert large

matrices and solve large linear systems. As the name

suggests, OOCore is an out-of-core solver using efficient

I/O routines to read and write parts of the matrix to disk.

Overflow2 is a CFD code that performs Euler and

Navier-Stokes calculations for laminar and turbulent

fluids in the vicinity of complex geometrical obstacles.

Overflow2 solves the CFD equations on a set of

overlapping grids, with higher grid-resolution near the

obstacles.

The Reduced Functionality CTH2 (RF-CTH2) code

is, as its name suggests, an incomplete distribution of the

CTH2 code that is used to study the effects of strong

shock waves on a variety of materials using many

different models. RF-CTH2 retains the basic shock

hydrodynamics equations of CTH2, but removes the most

advanced equations describing the behavior of the

material, replacing them with simpler models.

The Weather Research Forecast (WRF) code is a

weather modeling system. WRF is an MPI code with

optional OpenMP capabilities, and uses an underlying

communication layer (RSL) to perform the domain

decomposition.

The TI-05 benchmark establishes standard

performance measurement in terms of a time-to-solution.

Each test has a “standard” set of test cases, and some also

have a set of “large” test cases for verifying performance

when the application requires a large amount of memory.

For each set of tests, the TI-05 benchmark specifies

several different numbers of processors for running the

application. All of the TI-05 codes described above use

the MPI paradigm for parallelizing the code.

2.2. The NRL-05 Benchmark

Two applications (Causal and NRLMOL) that were

developed at NRL were also chosen for benchmarking the

Altix. Causal was chosen as both OpenMP and MPI

versions of the code exist and we wanted to evaluate how

an application could benefit from the shared memory

available on the Altix. NRLMOL was chosen because it

is a major user of cycles on the 256 processor Altix.

The Causal code[3] models the propagation of an

acoustic wave in water via a finite difference time domain

(FDTD) representation of the linear wave equation that

takes into account the dispersive properties of the medium

by the addition of the derivative of the convolution

between an operator and the acoustic pressure. Causal

was originally implemented in FORTRAN 77 with

OpenMP directives applied to the loops that update the

acoustic wave at each grid point of a 2D grid over range

and depth. The code was converted to MPI by

distributing the grid across processors assigning a set of

depths to each processor. The code was then converted

back to OpenMP using the MPI distribution scheme[4].

The NRLMOL code[2,5] implements the Density-

Functional formalism for clusters and molecules.

NRLMOL is a FORTRAN code that uses MPI to

parallelize the problem by using a master process to

distribute work to slaves. All parallelism is carried out

through this master/slave relationship.

3. Results

The benchmark runs on the Altix were run with two

goals in mind. The first was to examine how the

applications scaled as more processors were devoted to

running the calculation and the second was to evaluate the

improved performance resulting from the upgrade of the

Altix from 1.3 Ghz processors to the 1.6 GHz processors.

3.1. TI-05 Application Codes

In Reference 1 we discussed some algorithmic and

code optimizations performed to the NRL’s benchmark

suite to increase the performance of the SGI Altix 3000

computer. However, for the case of the TI-05 benchmark

codes, we did not change the original codes, except for a

few modifications required in a couple of cases to make

the code run.

Overall, the compiling and linking of most of the TI-

05 benchmark codes in the SGI Altix 3000 was generally

straightforward. Even so, for each case we had to select

appropriate compiler flags for optimal performance. In

particular, we modified the TI-05 makefiles to include

an Altix option, with specific compiling flags such as the

-ia64 (Altix’s native processor) and the –convert

big-endian (the Altix by default reads files in the

little-endian convention) required by the Itanium II

processors.

The timing experiments were performed on a “clean

system”. That is, the entire Altix computer was reserved

for the benchmark runs, but only the top processors were

dedicated to the computations. This strategy prevented

other users from running simultaneously with our runs,

and therefore avoided the introduction of unknown loads

on the system. By choosing the top processors for each

set of runs we were able to minimize the length of paths

for inter-processor communications.


Figures 1 through 6 summarize the relative speed of

the Altix for six of the TI-05 application codes. The data

has been scaled so that 1 represents the wall clock time

for the application running on the smallest number of

processors and using the 1.3 GHz processors. In each

case, we present the curves for the perfect scaling, the 1.3

GHz timings, and the actual 1.6 GHz timings.

GAMESS

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 50 100 150

Processors

1/S

calin

g

Perfect 1.3 GHz Altix 1.6 GHz Altix

Figure 1. GAMESS scaling

HYCOM

0

1

2

3

4

5

6

7

0 50 100 150

Processors

1/S

ca

lin

g


Figure 2. HYCOM scaling

OOCORE

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 20 40 60 8Processors

1/S

ca

lin

g

0


Figure 3. OOCore scaling

OVERFLOW2

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 20 40 60 8Processors

1/S

ca

lin

g

0


Figure 4. Overflow2 scaling

RF-CTH2

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 50 100 150Processors

1/S

ca

lin

g

Perfect 1.3 GHz Altix

Figure 5. RF-CTH2 scaling


WRF

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 8

Processors

1/S

ca

lin

g

0


Figure 6. WRF scaling

From the graphs above for the 1.3 GHz machine the

six TI-05 applications showed scaling up to 64

processors. HYCOM showed good scaling up to the

largest case run (124 processors). This was also true for

the 1.6 GHz machine with the additional comment that

WRF scaled better on the new processors. This may be

due to the larger cache available on the 1.6 Ghz

processors.

3.2. NRL-05 Application Codes

In this section we discuss the performance of the

Causal and NRLMOL codes. Table 1 reports the scaling

obtained for the NRLMOL code running on the Altix with

the 1.3 GHz processors, and Table 2 presents the scaling

using the 1.6 GHz configuration. The problem chosen for

the benchmark was the light-emitting molecule described

in Reference 2.

These two tables report not only on the total wall

clock time of the whole program scaled to the 50 1.3 GHz

processors case (Total), but also on partial timings for

selected subroutines. These are Mesh (mesh-related

operations such as reading, testing and writing that are

executed once) and three that are done several times Pot

(construction of the potential matrix elements), Wave

(construction of new wave functions), and Apot (complete

apotnl execution).

Times for the last three subroutines are given on an

iteration basis. While overall scaling is poor, parts of the

code (Pot and Apot) scale very well. The real problem

with scaling the code is Wave and it is in this area where

further work is needed to improve the parallelization of

the code.

Table 1. NRLMOL 1.3 GHz Altix scaling

Mesh Pot Wave Apot Total

50 0.122 0.036 0.091 0.084 1.000

100 0.107 0.021 0.079 0.046 0.737

150 0.101 0.015 0.073 0.033 0.647

200 0.098 0.011 0.070 0.027 0.602

250 0.096 0.009 0.068 0.023 0.574

Table 2. NRLMOL 1.6 GHz Altix scaling

Mesh Pot Wave Apot Total

50 0.109 0.030 0.087 0.071 0.887

100 0.096 0.017 0.074 0.039 0.655

150 0.091 0.012 0.069 0.028 0.579

200 0.088 0.009 0.066 0.023 0.540

250 0.086 0.008 0.064 0.020 0.516

Figure 7 presents our results for the Causal code.

Causal

0

2

4

6

8

10

12

14

16

18

0 5 10 15 20

Processors

1/S

ca

lin

g

Perfect Scaling Causal Scaling

Figure 7. Causal scaling

Finally, Table 3 shows the wall clock times for each

application code comparing the 1.3 GHz processors

against the 1.6 GHz. For each application code, the time

to run on the greatest number of processors was chosen.

All of the applications showed at least a 10%

improvement, with two applications HYCOM (23%) and

WRF (50%) showing increases greater than the increase

(19%) based on the increased clock rate alone.


CTATable3. 1.3 GHz processor vs. 1.6 GHz processor

1.3 GHz 1.6 GHz Improvement

GAMESS 0.45 0.40 11%

HYCOM 0.22 0.17 23%

OOCore 0.32 0.27 16%

Overflow2 0.28 0.24 14%

RFCTH2 0.34 0.29 15%

WRF 0.46 0.23 50%

NRLMOL 0.57 0.52 10%

The TI-05 application codes used in the evaluation of

Altix come from a broad spectrum of the HPC CTAs

including: Computational Chemistry and Materials

Science (CCM), Computational Electromagnetics and

Acoustics (CEA), Computational Fluid Dynamics (CFD),

Computational Structural Mechanics (CSM), and

Climate/Weather/Ocean Modeling and Simulation

(CWO).

References4. Significance to DoD

1. Anderson, W., R. Rosenberg, and M. Lanzagorta, “Early

Performance Results on the NRL SGI Altix 3000 Computer.”

Proceedings of the 2004 DoD HPCMP Users Group

Conference, Williamsburg, VA, June 7–11, 2004.

As the requirements for high performance computing

resources (computation, memory, and I/O) continue to

grow, and new resources become available in the

marketplace, the actual benefits to existing DoD and

Navy codes need to be examined. The TI-05 and NRL-05

applications benchmarks run on the NRL Altix 3000

computer provides valuable insights into the advantages

of a balanced architecture of processor speed and

memory. Based on these results, we expect that, in the

near future, the Altix architecture will be relevant for

those problems that need to achieve T3 (a teraflop of

computations, a terabyte of memory, and a terabit per

second of I/O) type of performance.

2. Baruah, T., “Massively parallel simulation of light harvesting

in an organic molecular triad.” To be presented at the DoD

HPCMP Users Group Meeting, 2005.

3. Norton, G. and J. Novarini, “Including dispersion and

attenuation directly in the time domain for wave propagation in

isotropic media.” J. Acoust. Soc. Am., 113, 2003, p. 3024.

4. Norton, G., W. Anderson, J. Novarini, R. Rosenberg, and M.

Lanzagorta, “Modeling pulse propagation and scattering in a

dispersive medium.” To be presented at the DoD HPCMP Users

Group Meeting, 2005.

5. Pederson, M., D. Porezag, J. Kortus, and D. Patton,

“Strategies for massively parallel local-orbital-based electronic

structure calculations.” Physica Status Solidi B, 217, 2000,

p. 197.

Systems Used

Computations were performed on resources provided

by the DoD HPCMP. Calculations were performed on the

SGI Altix 3000 computers at the U.S. Naval Research

Laboratory (NRL) shared resource center.


Documents

[IEEE 2005 Users Group Conference (DOD-UGC'05) - Nashville, TN, USA (27-30 June 2005)] 2005 Users Group Conference (DOD-UGC'05) - Benchmarking NRL's SGI Altix