Upload
m
View
213
Download
0
Embed Size (px)
Citation preview
Benchmarking NRL’s SGI Altix
Wendell Anderson and Robert Rosenberg
Naval Research Laboratory, Code 5593, (NRL-
DC), Washington, DC {Wendell.Anderson,
Robert.Roseneberg}@nrl.navy.mil
Marco Lanzagorta
Scientific & Engineering Solutions, Washington,
Abstract
For the past three years the Naval Research
Laboratory has been evaluating the Silicon Graphics Inc (SGI) Altix machine as a platform for high performance
computing within the Department of Defense (DoD). This
effort has been in two parts: first on systems with 1.3 GHz IA-64 processors with 3 MB cache and a NumaLink3
network architecture, and then on a system with 1.6 GHz IA-64 processors with 9 MB of cache and a NumaLink4
network architecture. As part of this work NRL has been
porting both internally developed codes and the TI-05 benchmark codes to the Altix and evaluating their
performance. Performance results of running codes on
an Altix system with the 1.3 GHz processors and a RedHat Linux based operating system and an Altix system
with 1.6 GHz processors and a SUSE based Linux based
operating system are presented.
1. Introduction
Since December 2002, the US Naval Research
Laboratory (NRL) has been evaluating under early access
programs SGI Altix systems. Starting from an initial
system of 16 900 MHz Itanium II processors with 16 GB
of memory, through the two current systems one with 128
1.3 GHz processors and 256 GB of memory, and the other
with 256 processors 1.6 GHz processors and 2 TB of
memory. During this period, NRL has always worked at
the leading edge of the Altix in software, currently
running Propack 3 (RedHat Linux based) on the 128
processor and alternately either Propack 3.3 (RedHat
Linux based) or ProPack 4 (SUSE based) on the 256
processor system.
2. Problems and Methodology
The work discussed in this paper is part of our
ongoing efforts to benchmark the NRL’s SGI Altix 3000
computer. In 2004, we reported on some of the
computational advantages and disadvantages of this
computer, using as a benchmark a suite of scientific codes
developed by NRL scientists as part of their research.[1]
However, in order to better evaluate the performance
of the SGI Altix system in a more extensive and
challenging computational environment, a more robust
benchmark was needed. Currently, we are using a
combination of the 2005 Technology Insertion (TI-05)
benchmarks and scientific codes developed at NRL.
2.1. The TI-05 benchmark
The TI-05 benchmarks are used by the Department of
Defense High Performance Computing Modernization
Program (HPCMP) to ensure that their supercomputing
assets meet the needs of the DoD community that they
serve. In order to meet the growing needs of its users, the
HPCMP needs to continually upgrade its computers to
maintain the current state of the art in computing
resources. Part of this evaluation is a set of application
codes, representing a sample of actual HPC scientific
codes based on expected usage and algorithmic
considerations. We studied six of the eight application
codes from the TI-05 benchmark distribution: GAMESS,
HYCOM, OOCore, Overflow2, RF-CTH2, and WRF.
Together, they cover diverse areas of interest to DoD,
including computational fluid dynamics, computational
chemistry, and weather and ocean modeling.
The General Atomic and Molecular Electronic
Structure System (GAMESS) is a code that simulates
general ab-initio quantum chemistry. Several molecular
properties, such as dipole moments and frequency
Proceedings of the Users Group Conference (DOD-UGC’05) 0-7695-2496-6/05 $20.00 © 2005 IEEE
dependent hyper-polarizabilities, can be computed with
GAMESS.
HYCOM stands for Hybrid Coordinate Ocean Model
and encodes a primitive equation for the numerical
simulation of a general ocean circulation model. This
code can be used as a global ocean data assimilation
system or as the ocean component of a coupled ocean-
atmosphere model.
The out-of-core (OOCore) code uses ScaLAPACK,
BLACS, and BLAS libraries to factor and invert large
matrices and solve large linear systems. As the name
suggests, OOCore is an out-of-core solver using efficient
I/O routines to read and write parts of the matrix to disk.
Overflow2 is a CFD code that performs Euler and
Navier-Stokes calculations for laminar and turbulent
fluids in the vicinity of complex geometrical obstacles.
Overflow2 solves the CFD equations on a set of
overlapping grids, with higher grid-resolution near the
obstacles.
The Reduced Functionality CTH2 (RF-CTH2) code
is, as its name suggests, an incomplete distribution of the
CTH2 code that is used to study the effects of strong
shock waves on a variety of materials using many
different models. RF-CTH2 retains the basic shock
hydrodynamics equations of CTH2, but removes the most
advanced equations describing the behavior of the
material, replacing them with simpler models.
The Weather Research Forecast (WRF) code is a
weather modeling system. WRF is an MPI code with
optional OpenMP capabilities, and uses an underlying
communication layer (RSL) to perform the domain
decomposition.
The TI-05 benchmark establishes standard
performance measurement in terms of a time-to-solution.
Each test has a “standard” set of test cases, and some also
have a set of “large” test cases for verifying performance
when the application requires a large amount of memory.
For each set of tests, the TI-05 benchmark specifies
several different numbers of processors for running the
application. All of the TI-05 codes described above use
the MPI paradigm for parallelizing the code.
2.2. The NRL-05 Benchmark
Two applications (Causal and NRLMOL) that were
developed at NRL were also chosen for benchmarking the
Altix. Causal was chosen as both OpenMP and MPI
versions of the code exist and we wanted to evaluate how
an application could benefit from the shared memory
available on the Altix. NRLMOL was chosen because it
is a major user of cycles on the 256 processor Altix.
The Causal code[3] models the propagation of an
acoustic wave in water via a finite difference time domain
(FDTD) representation of the linear wave equation that
takes into account the dispersive properties of the medium
by the addition of the derivative of the convolution
between an operator and the acoustic pressure. Causal
was originally implemented in FORTRAN 77 with
OpenMP directives applied to the loops that update the
acoustic wave at each grid point of a 2D grid over range
and depth. The code was converted to MPI by
distributing the grid across processors assigning a set of
depths to each processor. The code was then converted
back to OpenMP using the MPI distribution scheme[4].
The NRLMOL code[2,5] implements the Density-
Functional formalism for clusters and molecules.
NRLMOL is a FORTRAN code that uses MPI to
parallelize the problem by using a master process to
distribute work to slaves. All parallelism is carried out
through this master/slave relationship.
3. Results
The benchmark runs on the Altix were run with two
goals in mind. The first was to examine how the
applications scaled as more processors were devoted to
running the calculation and the second was to evaluate the
improved performance resulting from the upgrade of the
Altix from 1.3 Ghz processors to the 1.6 GHz processors.
3.1. TI-05 Application Codes
In Reference 1 we discussed some algorithmic and
code optimizations performed to the NRL’s benchmark
suite to increase the performance of the SGI Altix 3000
computer. However, for the case of the TI-05 benchmark
codes, we did not change the original codes, except for a
few modifications required in a couple of cases to make
the code run.
Overall, the compiling and linking of most of the TI-
05 benchmark codes in the SGI Altix 3000 was generally
straightforward. Even so, for each case we had to select
appropriate compiler flags for optimal performance. In
particular, we modified the TI-05 makefiles to include
an Altix option, with specific compiling flags such as the
-ia64 (Altix’s native processor) and the –convert
big-endian (the Altix by default reads files in the
little-endian convention) required by the Itanium II
processors.
The timing experiments were performed on a “clean
system”. That is, the entire Altix computer was reserved
for the benchmark runs, but only the top processors were
dedicated to the computations. This strategy prevented
other users from running simultaneously with our runs,
and therefore avoided the introduction of unknown loads
on the system. By choosing the top processors for each
set of runs we were able to minimize the length of paths
for inter-processor communications.
Proceedings of the Users Group Conference (DOD-UGC’05) 0-7695-2496-6/05 $20.00 © 2005 IEEE
Figures 1 through 6 summarize the relative speed of
the Altix for six of the TI-05 application codes. The data
has been scaled so that 1 represents the wall clock time
for the application running on the smallest number of
processors and using the 1.3 GHz processors. In each
case, we present the curves for the perfect scaling, the 1.3
GHz timings, and the actual 1.6 GHz timings.
GAMESS
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 50 100 150
Processors
1/S
calin
g
Perfect 1.3 GHz Altix 1.6 GHz Altix
Figure 1. GAMESS scaling
HYCOM
0
1
2
3
4
5
6
7
0 50 100 150
Processors
1/S
ca
lin
g
Perfect 1.3 GHz Altix 1.6 GHz Altix
Figure 2. HYCOM scaling
OOCORE
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 20 40 60 8Processors
1/S
ca
lin
g
0
Perfect 1.3 GHz Altix 1.6 GHz Altix
Figure 3. OOCore scaling
OVERFLOW2
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 20 40 60 8Processors
1/S
ca
lin
g
0
Perfect 1.3 GHz Altix 1.6 GHz Altix
Figure 4. Overflow2 scaling
RF-CTH2
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 50 100 150Processors
1/S
ca
lin
g
Perfect 1.3 GHz Altix
Figure 5. RF-CTH2 scaling
Proceedings of the Users Group Conference (DOD-UGC’05) 0-7695-2496-6/05 $20.00 © 2005 IEEE
WRF
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 8
Processors
1/S
ca
lin
g
0
Perfect 1.3 GHz Altix 1.6 GHz Altix
Figure 6. WRF scaling
From the graphs above for the 1.3 GHz machine the
six TI-05 applications showed scaling up to 64
processors. HYCOM showed good scaling up to the
largest case run (124 processors). This was also true for
the 1.6 GHz machine with the additional comment that
WRF scaled better on the new processors. This may be
due to the larger cache available on the 1.6 Ghz
processors.
3.2. NRL-05 Application Codes
In this section we discuss the performance of the
Causal and NRLMOL codes. Table 1 reports the scaling
obtained for the NRLMOL code running on the Altix with
the 1.3 GHz processors, and Table 2 presents the scaling
using the 1.6 GHz configuration. The problem chosen for
the benchmark was the light-emitting molecule described
in Reference 2.
These two tables report not only on the total wall
clock time of the whole program scaled to the 50 1.3 GHz
processors case (Total), but also on partial timings for
selected subroutines. These are Mesh (mesh-related
operations such as reading, testing and writing that are
executed once) and three that are done several times Pot
(construction of the potential matrix elements), Wave
(construction of new wave functions), and Apot (complete
apotnl execution).
Times for the last three subroutines are given on an
iteration basis. While overall scaling is poor, parts of the
code (Pot and Apot) scale very well. The real problem
with scaling the code is Wave and it is in this area where
further work is needed to improve the parallelization of
the code.
Table 1. NRLMOL 1.3 GHz Altix scaling
Mesh Pot Wave Apot Total
50 0.122 0.036 0.091 0.084 1.000
100 0.107 0.021 0.079 0.046 0.737
150 0.101 0.015 0.073 0.033 0.647
200 0.098 0.011 0.070 0.027 0.602
250 0.096 0.009 0.068 0.023 0.574
Table 2. NRLMOL 1.6 GHz Altix scaling
Mesh Pot Wave Apot Total
50 0.109 0.030 0.087 0.071 0.887
100 0.096 0.017 0.074 0.039 0.655
150 0.091 0.012 0.069 0.028 0.579
200 0.088 0.009 0.066 0.023 0.540
250 0.086 0.008 0.064 0.020 0.516
Figure 7 presents our results for the Causal code.
Causal
0
2
4
6
8
10
12
14
16
18
0 5 10 15 20
Processors
1/S
ca
lin
g
Perfect Scaling Causal Scaling
Figure 7. Causal scaling
Finally, Table 3 shows the wall clock times for each
application code comparing the 1.3 GHz processors
against the 1.6 GHz. For each application code, the time
to run on the greatest number of processors was chosen.
All of the applications showed at least a 10%
improvement, with two applications HYCOM (23%) and
WRF (50%) showing increases greater than the increase
(19%) based on the increased clock rate alone.
Proceedings of the Users Group Conference (DOD-UGC’05) 0-7695-2496-6/05 $20.00 © 2005 IEEE
CTATable3. 1.3 GHz processor vs. 1.6 GHz processor
1.3 GHz 1.6 GHz Improvement
GAMESS 0.45 0.40 11%
HYCOM 0.22 0.17 23%
OOCore 0.32 0.27 16%
Overflow2 0.28 0.24 14%
RFCTH2 0.34 0.29 15%
WRF 0.46 0.23 50%
NRLMOL 0.57 0.52 10%
The TI-05 application codes used in the evaluation of
Altix come from a broad spectrum of the HPC CTAs
including: Computational Chemistry and Materials
Science (CCM), Computational Electromagnetics and
Acoustics (CEA), Computational Fluid Dynamics (CFD),
Computational Structural Mechanics (CSM), and
Climate/Weather/Ocean Modeling and Simulation
(CWO).
References4. Significance to DoD
1. Anderson, W., R. Rosenberg, and M. Lanzagorta, “Early
Performance Results on the NRL SGI Altix 3000 Computer.”
Proceedings of the 2004 DoD HPCMP Users Group
Conference, Williamsburg, VA, June 7–11, 2004.
As the requirements for high performance computing
resources (computation, memory, and I/O) continue to
grow, and new resources become available in the
marketplace, the actual benefits to existing DoD and
Navy codes need to be examined. The TI-05 and NRL-05
applications benchmarks run on the NRL Altix 3000
computer provides valuable insights into the advantages
of a balanced architecture of processor speed and
memory. Based on these results, we expect that, in the
near future, the Altix architecture will be relevant for
those problems that need to achieve T3 (a teraflop of
computations, a terabyte of memory, and a terabit per
second of I/O) type of performance.
2. Baruah, T., “Massively parallel simulation of light harvesting
in an organic molecular triad.” To be presented at the DoD
HPCMP Users Group Meeting, 2005.
3. Norton, G. and J. Novarini, “Including dispersion and
attenuation directly in the time domain for wave propagation in
isotropic media.” J. Acoust. Soc. Am., 113, 2003, p. 3024.
4. Norton, G., W. Anderson, J. Novarini, R. Rosenberg, and M.
Lanzagorta, “Modeling pulse propagation and scattering in a
dispersive medium.” To be presented at the DoD HPCMP Users
Group Meeting, 2005.
5. Pederson, M., D. Porezag, J. Kortus, and D. Patton,
“Strategies for massively parallel local-orbital-based electronic
structure calculations.” Physica Status Solidi B, 217, 2000,
p. 197.
Systems Used
Computations were performed on resources provided
by the DoD HPCMP. Calculations were performed on the
SGI Altix 3000 computers at the U.S. Naval Research
Laboratory (NRL) shared resource center.
Proceedings of the Users Group Conference (DOD-UGC’05) 0-7695-2496-6/05 $20.00 © 2005 IEEE