High-performance computing at Silicon Graphics/Cray Research

Applied Numerical Mathematics 30 (1999) 125–135

High-performance computing at Silicon Graphics/Cray Research

Gerardo Cisnerosa,∗, Jeff P. Brooksba Silicon Graphics, S.A. de C.V., Av. Vasco de Quiroga 3000, Col. Santa Fe, 01210 Mexico, DF, Mexico

b Silicon Graphics, Inc., 655F Lone Oak Drive, Eagan, MN 55121-1560, USA

Abstract

In this paper quadrants characterizing typical computational kernels are used to discuss the performance ofparallel–vector, distributed-memory massively parallel, and distributed-shared memory parallel supercomputers,and how systems of such different architectures can satisfy the requirements of different application and customertypes. 1999 Elsevier Science B.V. and IMACS. All rights reserved.

Keywords:Supercomputers; Performance; Applications

1. Introduction

Supercomputers are usually defined as being the computer systems that can solve the largest problemsin the least amount of time. This, of course, is a time-dependent definition, and throughout the yearsdifferent systems have come to be known as supercomputers.

Although the term “supercomputer” did not arise until the 1970s (see [11] for an early paper with theword in its title), the term is now applied on occasion to earlier systems that were noteworthy as leadingand capable scientific computers, such as the CDC 6600 and 7600.

Since introducing the CRAY-1 in 1976, Cray Research, which is now a subsidiary of Silicon Graphics,Inc. (SGI), has been the leading supplier of supercomputers. Currently, SGI/CRAY has four productlines:

(1) CRAY T90. This is the fifth-generation vector supercomputer. It is a shared-memory parallelsystem with models supporting up to 4 processors (T94), each capable of up to 1.8×109 floatingpoint operations per second (1.8 Gflop/s), and models supporting up to 16 (T916) and 32 (T932)processors, each capable of up to 1.76 Gflop/s. On all CRAY T90 systems the memory subsystemcan sustain data transfers in excess of 21 Gbyte/s per CPU.

(2) CRAY J90. This is a low-cost version of the third-generation parallel–vector CRAY Y-MP, withscalar performance enhancements. Available models support up to 8, 16 or 32 processors, each

∗ Corresponding author. E-mail: [email protected].

0168-9274/99/$19.00 1999 Elsevier Science B.V. and IMACS. All rights reserved.PII: S0168-9274(98)00089-0

126 G. Cisneros, J.P. Brooks / Applied Numerical Mathematics 30 (1999) 125–135

with a peak performance rating of 200× 106 flop/s (200 Mflop/s). A processor for the CRAY J90with a substantial performance increase is expected to be available in late 1998.

(3) CRAY T3E . This is the second-generation CRAY massively parallel system, which is based onDigital’s Alpha EV5.6 microprocessor (300, 450 and 600 MHz). Configurations range from 10to 2048 processors; installed systems average near 200. It is a distributed-memory system with a3-dimensional torus communications network.

(4) Origin 2000. This is the newest member of the SGI/CRAY line, a distributed memory systemrunning on a shared memory, single operating system image model. The Origin 2000 is based onthe MIPS R10000 RISC (180, 195 or 250 MHz) microprocessor, with a large secondary cache(1 Mbyte or 4 Mbyte). The communications network has a hypercube topology built on 6-portrouters. Systems are available in configurations with 2 to 128 processors.

The question arises, which of these four systems is appropriate for a given customer’s computationalproblems? In this paper we attempt to answer this question by addressing each architecture’sdistinguishing strengths. In Section 2 we discuss memory bandwidth and latency issues and someapproaches that have been used to minimize the effects of latency. Locality and vectorizability areexplained in Section 3 and used in Section 4 to classify typical computational kernels, and exhibitperformance comparisons among various systems. A summary and conclusions are given in Section 5.

2. Memory latency and bandwidth

Memory latencyis the time required for the memory subsystem to deliver an operand to a processorfrom the time it is first requested. Typical latencies range from a few tens of nanoseconds (ns) for (theexpensive) static random access memories (SRAMs) to several hundreds of ns for (the inexpensive)dynamic RAMs (DRAMs). Since current peak processing speeds for both microprocessors and vectorprocessors range between 200 and 2000 Mflop/s, this means that one or two operands must be deliveredto the processor every few ns. Given SRAM and DRAM latencies, such latencies must be dealt with ifsustained processing speeds are to approach substantial fractions of the peak processing speeds.

Memory bandwidthis the asymptotic rate at which the memory subsystem can move data in and outof the processors. Typical bandwidth on current supercomputers ranges from a few hundred Mbyte/s totens of Gbyte/s per processor [7].

Over the past 30 years, the true memory latency of high-end computer systems has changed little,while processor performance has increased by several orders of magnitude [6]. Computer architects haveresponded to this challenge by devising memory hierarchies of increasing complexity, and by providingfeatures with lower latency for certain memory access patterns. Compiler writers have increased the depthto which code segments are evaluated in order to find more operations for hiding latency. Applicationdevelopers have turned to new algorithms that better exploit locality of data access. While manyapplications have benefited from one or more of these innovations, the challenge of hiding memorylatency and achieving a significant fraction of the peak speed on scientific and engineering codescontinues to grow. In effect, it can be argued that the main problem to be solved in designing a high-performance computing system is not one of building fast processors, but keeping them busy.

Several approaches have been used by SGI/CRAY to hide memory latency:(1) Vector registers and instructions: The memory is divided into a large number of banks, sayn,

and addressing is interleaved, so that any givenn consecutive storage locations belong each to a

G. Cisneros, J.P. Brooks / Applied Numerical Mathematics 30 (1999) 125–135 127

different bank. The memory subsystem supports high data rates, so that the latency of the second,third, etc. accesses in a vector load or store is hidden behind the latency of each preceding access.This works even for non-stride-1 (nonconsecutive) accesses, as long as the access stride is not amultiple of n and is best when the stride is relatively prime ton. The downside is that it workswell only for applications whose data can be processed as vectors, i.e.,vectorizablecodes.

(2) Caches: A small portion of SRAM is built into the processor or placed next to it. Currentmicroprocessors support both internal (primaryor “L1”) and external (secondaryor “L2”) caches;typical sizes are a few tens of Kbyte for internal caches and from a few hundred Kbyte to a fewMbyte for external caches. Caches are structured as sets ofcache lines, into which a few tens ofbytes may be moved in from or out to memory at a time. Longer cache lines look somewhat likestride-1 vectors. Typical latencies are a few ns for reads from the L1 cache and a few tens of ns forreads from the L2 cache.

(3) Streaming memory: A “smart” memory system that detects certain access patterns and prefetchesoperands from DRAM into fast buffers once a stream is detected. This is used in the CRAY T3Einstead of secondary caches.

(4) E-registers: Another memory system enhancement used in the CRAY T3E to get around the stride-1 dependence of microprocessor caches; E-registers enhance performance for gather/scatter andnon-stride-1 references.

In the remainder of the paper, our comparisons include systems produced by SGI/CRAY and do notconsider address randomization and multithreading.

A sampling of past and present high-performance computing systems, shown in Table 1, tells thelatency story.

Table 1Historical memory latencies

System Memory latency, Clock period, ML/CP FP ops FP ops toML (ns) CP (ns) ratio per CP cover ML

CDC 7600 220 27.5 8 1 8

CRAY-1 150 12.5 12 2 24

CRAY X-MP 120 8.5 14 2 28

CRAY Y-MP 102 6.0 17 2 34

CRAY C90 96 4.17 23 4 46

SGI PwrCh ∼760 13.3 57 4 228

CRAY J90 340 10.0 34 2 68

CRAY T94 76 2.22 34 4 136

CRAY T932 114 2.22 51 4 204

CRAY T3E-900 ∼280 2.2 126 2 252

Origin 2000 ∼313 5.1 61 2 122


Each of the systems in Table 1 that are highlighted in boldface introduced new techniques to effectivelyreduce or hide memory latency. The CDC 7600 decoupled loads and stores from a set of pipelinedfunctional units. The CRAY-1 employed vector registers and a pipelined memory system. The CRAYX-MP added gather/scatter hardware to the basic CRAY-1 design to handle irregular memory accesspatterns. The SGI Power Challenge employed a large (512 kB) streaming board cache which couldeffectively pipeline memory operations at high bandwidth. The CRAY T3E has a smaller cache but addsstream buffers for local data that are accessed with a small or unit stride and E-registers for data accessesthat should bypass the cache.

All of the machines discussed in this paper handle latency tolerance with differing mechanisms.Consequently, the realized performance of C and Fortran programs is in large part due to how thecompilers map such code onto these hardware features.

3. Vectorizability and locality

A loop is said to bevectorizableif it can be executed using vector registers in such a way that theresults produced are the same as those that would be produced by execution using scalar registers [2].Codes containing few or no vectorizable loops are calledscalarcodes. Compiler technology to recognizeand exploit vector capabilities has matured significantly over the past 20 years.

The termlocality is used to denote a property of a program’s memory access pattern. A section of codehasspatial localityif most of its memory references are made to a small set of closely spaced addresses.For example, the loop

do i = 1,n

a(i )= b(i )+ c ∗ d(i )

enddo

has good spatial locality across iterations for each of the arrays it references. This code works well onvector architectures for large enoughn. Note that vector overhead can be amortized with as few as threeiterations, depending on the particular architecture. The loop also works well on cache-based processors,especially if the vectors fit in the cache and are already in it when the loop starts executing.

Temporallocality is observed in code that accesses the same memory locations multiple times. It isexploited by the hardware if the data items remain in the cache between the succeeding references. Forexample, consider the following loop:

do j = 2,n− 1

do i = 2,n− 1

a(i , j )= x(i − 1, j − 1)+ x(i , j − 1)+ x(i + 1, j − 1)

∗ + x(i − 1, j )+ x(i , j )+ x(i + 1, j )

∗ + x(i − 1, j + 1)+ x(i , j + 1)+ x(i + 1, j + 1)

enddo

enddo

In this example, we are reading a 9-point stencil in theij plane centered atx(i,j) . If the cache is largeenough to hold 3 columns of thex array, then at least 8 out of 9 right-hand side references will hit in the


cache after the first iteration onj ; i.e., a cache miss will occur in at most the reference tox(i+1,j+1)for each value ofi .

Finite difference computations tend to exhibit temporal locality in two or three dimensions; this localityis harder for a compiler to exploit, since it may occur in adjoining loops, across outer loop iterations oreven across subroutine calls. Currently, vector systems do not exploit temporal locality at all, whereascaches allow operand reuse across iterations and therefore do very well on such loops. Registers areanother source of potential reuse.

Generally, loops with non-stride-1 or gather/scatter references have no locality. Let us assume in thefollowing example, thatb has more than one column, i.e.,j >1, and recall that Fortran arrays are storedin column major order. Then the following loop has a non-stride-1 reference on the row access to matrixb and a gather reference on vectora:

do i = 1,n

b(j , i )= a(index (i ))

enddo

For this kind of loop, vector systems are far superior. The streaming memory and E-registers on theCRAY T3E would also handle this loop well. Cache-based systems, however, will bring in from memorywith each cache line extra operands that do not get used.

4. Application quadrants and typical kernels

How does one determine which SGI/CRAY product best fits a given customer’s needs? We set up fourcategories depending on two criteria:

(a) dominant application type, which can be “third-party”, as developed by independent softwarevendors, or “roll-your-own”, as developed by the customer’s own users and/or staff, and

(b) dominant purchasing constraint, which ranges from considering price/performance ratios toabsolute performance.

Why distinguish between “roll-your-own” and third-party applications, proposed as criterion (a)above? By definition, if applications required rewriting in order to use E-registers, vectorize the code,or use distributed memory, “roll-your-own” users would be willing to effect such modifications. Third-party software vendors, on the other hand, are usually reluctant to make changes to code for performancereasons that are not generally applicable. For instance, most of the changes required to optimize a codefor an Origin 2000 would consist of increasing temporal locality and hence would be reasonably portableto other scalar, cache-based architectures. Still, since vectorization depends only on inner-loop analysisand vectorizing compilers have improved significantly over the last 20 years or so, many third-partyapplications vectorize and run well on CRAY T90 and CRAY J90 systems.

As for the classification of the market into a segment for which the price/performance ratio isimportant and a segment for which raw performance is paramount, Table 2 puts the price differencesin perspective.

Figures in Table 2 are approximate; total system cost can vary greatly depending on memory size,peripherals, and software.

Fig. 1 shows the four products distributed among the four quadrants defined by the two categories.


Table 2

System Approx. price per CPU

CRAY T90 $800,000

CRAY J90 $30,000

CRAY T3E-1200 $30,000

Origin 2000 $20,000

Fig. 1. Product positioning.

The CRAY J90 and the CRAY T90 have very similar architectures, but are built from differenttechnologies. The CRAY J90alwayshas a better price/performance ratio than the CRAY T90 and willrun any problem that the CRAY T90 will run, but the CRAY T90 will run it 5–7 times faster.

The CRAY T3E generally has an excellent price/performance ratio. However, for small systems withfew processors, the Origin 2000 has a better price/performance ratio. The Origin 2000 also sports amore flexible design in that it allows for shared-memory-style programming. This model is known tobe a parallel programming model that is easier to use than message-passing [12]. For these reasons, theCRAY T3E is not included in quadrants 1 and 3 of Fig. 1. Furthermore, few third-party applications existfor the CRAY T3E as of this writing; hence, the CRAY T3E is not included in quadrant 2. However, theCRAY T3E is designed to scale to 2048 processors and has special hardware for the performance-hungryuser, so it is firmly placed in quadrant 4.

The CRAY J90 and the Origin 2000 overlap substantially in their price ranges, so the question is now,which is more appropriate? To answer this question, we now classify applications on the basis of twoattributes: (a) level of vectorization, and (b) temporal locality of memory accesses.


Fig. 2. Code classification and dominant system parameters.

Fig. 3. Typical CRAY J90 to Origin 2000 single-processor performance ratios by code type. Quadrants are thesame as in Fig. 2.

Fig. 2 shows the resulting quadrants, with the lower right quadrant being further subdivided into foursections. Each quadrant and subquadrant is labeled by the system parameters that largely determine theperformance.

Fig. 3 shows the same quadrants as Fig. 2, but we now have inserted typical CRAY J90 to Origin 2000single-processor performance ratios on applications dominated by the attributes of the correspondingquadrants. The Origin 2000 clearly dominates on scalar codes that exhibit good locality of access


(temporal and spatial), while the CRAY J90 excels at running codes that are highly vectorized and aredominated by non-local memory accesses, especially non-stride-1 and gather/scatter. In the remainingtwo quadrants these systems’ performances are within a factor of 2 of each other.

While the performance ratios in Fig. 3 were not arrived at through any sort of scientific study, they arethe results of 2 years of experience in benchmarking them with real customer applications.

Most codes have a mixture of the attributes in the various quadrants and subquadrants, but we haveselected some examples that fit neatly into these categories.

In the remainder of the section we illustrate the preceding points with data from actual applicationsand benchmarks. Origin 2000 runs ware carried out on a system with 195 MHz processors; CRAY T3Eruns were performed on a system with 300 MHz processors.

4.1. LINPACK1000× 1000

LINPACK N × N is a classical benchmark [3] in which a large percentage of the time is spent indoing a matrix multiplication: O(N3) operations are performed on O(N2) data. The problem can becoded for either efficient use of vector capabilities or extensive cache reuse. With caches that can nowaccommodate large fractions of the data and especially tuned libraries, theN = 1000 case runs well onvirtually any system, typically at 90% or more of peak speeds. Table 3 shows single-processor results forcurrent SGI/CRAY systems.

ForN = 1000, LINPACKN ×N belongs in quadrant 3 of Fig. 2.

4.2. DoD benchmark NIKE3D

This benchmark resembles a real application more closely than LINPACK. The time is spent mostlyin a dense linear equation solver, so it is largely a vectorizable code with good locality. Single-processorperformances for the Origin 2000 and the vector systems are given in Table 4.

NIKE3D can also be classified as a quadrant-3 code. The code and test case were part of a benchmarkfor the U.S. Department of Defense High Performance Computing Modernization Program (http://www.hpcmo.hpc.mil/ ).

4.3. STREAM benchmark

The STREAM benchmark [7] is a set of four kernels intended to measure bandwidth and performancein processing very long vectors. Its stride-1 memory accesses have good spatial locality but no temporallocality. Consequently, it has little cache reuse. STREAM is thus a code from subquadrant 4a of Fig. 2.

Table 3

System Mflop/s

CRAY T90 1603

CRAY J90 191

T3E 380

Origin 2000 344

Table 4

System Mflop/s

CRAY T90 940

CRAY J90 160

Origin 2000 200


Table 5

System MB/s Mflop/s

CRAY T90 359,270 29,939

CRAY J90 18,870 1,572

CRAY T3E 17,148 1,429

Origin 2000 6,539 545

Table 6

System CPU time (s) Mflop/s

CRAY T90 70 487

CRAY J90 560 61

Origin 2000 935 37

Table 7

System CPU time (s) Mflop/s

CRAY T90 1926 35

CRAY J90 8960 7.5

Origin 2000 1083 62

We measured the “triad” kernel of STREAM, which isa = b + s · c, for vectorsa, b, and c andscalars. Table 5 shows 32-processor memory and compute performances obtained by running on eachof the SGI/CRAY systems.

This is one of few benchmarks where the CRAY T90 has a superior price/performance ratio comparedto the Origin 2000.

4.4. Los Alamos NEUT

NEUT is a Monte-Carlo neutron transport code from Los Alamos National Laboratory [13]. It ischaracterized by its extensive use of long vectors and gather/scatter operations. Since the code is fullyvectorized and has very little temporal locality, the benchmark runs faster on vector systems than onmicroprocessor-based systems, as Table 6 of single-processor execution time exhibits.

NEUT is a code representative of subquadrant 4d of Fig. 2.

4.5. GAMESS

GAMESS (General Atomic and Molecular Structure System) is anab initio quantum chemistrypackage [9]. It vectorizes quite poorly but exhibits fairly good temporal locality. It is a very goodapplication for cache-based systems. The single-processor performances shown in Table 7 clearly place


this code in quadrant 1 of Fig. 2. The test case used was defined by the U.S. DoD High PerformanceComputing Modernization Program.

An Origin 2000 processor is nearly 80% faster than a much more expensive CRAY T90 processor. Itis also more than eight times faster than a CRAY J90 processor.

5. Summary and conclusions

The technological pros and cons of each of the current SGI/CRAY high-performance product familiescan be summarized as follows:

(1) CRAY T90, CRAY J90: Vector architecture.Pros: Superior technology for exploiting spatial locality where little or no temporal locality exists.Long latencies can be covered with a single vector instruction. Generally, inner loops written in adependency-free style can vectorize automatically.Cons: Temporal locality not exploited except for a tiny scalar-only data cache. Sometimes codemodification is required to get vectorization.

(2) CRAY T3E : Cache, stream buffers, E-registers.Pros: Variety of hardware mechanisms to handle a variety of situations. Cache handles small-scaletemporal locality. Stream buffers handle arbitrarily large scale spatial locality. E-registers handlestrided reference and gather/scatter type constructs that offer poor temporal and spatial locality.Cons: Only six simultaneous streams are allowed, so compiler or programmer intervention may berequired. E-register use requires programmer intervention via directives or libraries. Both of thecons, however, have to do with exploiting the hardware features with an “as-is” code, so they arenot a problem for the performance-hungry roll-your-own user.

(3) Origin 2000: Large 4MB data cache.Pros: Superior technology for exploiting temporal locality. Wide cache lines exploit spatial localityin hardware. Generally an “automatic” technology, with the main optimization consisting ofblocked algorithms which are portable across a wide variety of cache-based architectures, resultingin a wider selection of third-party codes.Cons: Does not handle constructs with little spatial or temporal locality well.

In conclusion, the following points can be made regarding the merits of each product family:(1) CRAY T90. The CRAY T90 family is best suited for those users for whom absolute performance

is the critical buying factor. Customers for the CRAY T90 usually are not overly concernedwith price/performance ratios because the truly high CPU speed, memory bandwidth and I/Operformance provide enough return on investment that the machine pays for itself. It has also beensaid that the value of the CRAY T90 system can be measured not in terms of Mflop/s, but of thenumber of engineering decisions a day [10].

(2) CRAY J90. When both low price and price/performance are overriding criteria, this is thesystem of choice for applications that are highly vectorized and require high memory bandwidth.Additional advantages include reliability (hardware and software mean time between failures isnearly a year), a robust operating system with many “data center” features, and a CPU upgradethat is expected to become available by the end of 1998, which will improve the CRAY J90’ssustained performance by a factor between two and four, depending on the application [8].


(3) CRAY T3E . There are very few third-party applications for the CRAY T3E, and therefore thissystem is found mainly in the “roll-your-own” customer space, e.g., universities, research andweather centers; the average number of processors for currently installed systems exceeds 170.The programming model is by message passing, and its Cray-developed SHMEM library [1] iscurrently the world’s fastest and easiest to use communications library. The other commonly-usedmessage-passing libraries, PVM [4] and MPI [5] are, of course, supported also on the CRAY T3E.

(4) Origin 2000. This family is based on a microprocessor having extremely low internal latencies thatresult in excellent scalar performance, especially when the large secondary cache is appropriatelyused. For parallel applications, the support of a shared-memory programming model coupled witha very large address space allows ease of development. The shared-memory model also supportshighly efficient implementations of the message-passing libraries PVM, MPI and SHMEM. Forusers of third-party software, the catalog of supported applications includes a few thousand titles.

Acknowledgements

The authors thank Richard Sandness, Enrique López-Pineda, and Michael Y. Sheh for helpfulcomments and suggestions, and an anonymous referee for many remarks that improved the presentation.

References

[1] Cray Research, Inc., SHMEM Technical Note SN-2516 2.3, Eagan, MN (1994).[2] Cray Research, Inc., Optimizing code on Cray PVP systems, SG-2192 2.0, Eagan, MN (1995).[3] J.J. Dongarra, Performance of various computers using standard linear equations software, Technical Report

CS-89-85, University of Tennessee (1997).[4] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek and V. Sunderam, PVM 3 user’s guide and reference

manual, Technical Report ORNL/TM-12187, Oak Ridge National Laboratory (1994).[5] W. Gropp, E. Lusk and A. Skjellum,Using MPI(MIT Press, Cambridge, MA, 1994).[6] J.L. Hennessy and D.A. Patterson,Computer Architecture: A Quantitative Approach(Morgan Kaufmann, San

Francisco, CA, 2nd ed., 1996).[7] J. McCalpin, Sustainable memory bandwidth in current high performance computers,http://reality.

sgi.com/mccalpin/papers/bandwidth , Silicon Graphics, Inc., Mountain View, CA (1995).[8] R.A. Sandness, Private communication.[9] M.W. Schmidt, K.K. Baldridge, J.A. Boatz, S.T. Elbert, M.S. Gordon, J.H. Jensen, S. Koseki, N. Matsunaga,

K.A. Nguyen, S. Su, T. L. Windus, M. Dupuis and J.A. Montgomery, General atomic and molecular electronicstructure system,J. Comput. Chem.14 (1993) 1347–1363.

[10] M.Y. Sheh, Private communication.[11] D.J. Theis, Special tutorial: Vector supercomputers,Computer7 (4) (1974) 52–61.[12] S.P. VanderWiel, D. Nathanson and D.J. Lilja, Complexity and performance in parallel programming

languages, in:Proc. International Workshop on High-Level Parallel Programming Models and SupportiveEnvironments(April 1997) 3–12.

[13] H.J. Wasserman and M.L. Simmons, Benchmark tests on the Cray Research, Inc., CRAY J90, TechnicalReport LA-UR-95-4111, Los Alamos National Laboratory (1995); alsohttp:// www.c3.lanl.gov/˜hjw/Web_papers/pubs.html .

Documents

High-performance computing at Silicon Graphics/Cray Research