A Historical Overview and Examination of Future Developments › ... › 172_2006_trends-in-high-performance-computin… · n this article, we analyze major recent trends and changes

Jack Dongarra

A Historical Overview and Examination of Future

Developments

■ 22 IEEE CIRCUITS & DEVICES MAGAZINE ■ JANUARY/FEBRUARY 20068755-3996/06/$20.00 ©2006 IEEE

© EYEWIRE

In this article, we analyze major recent trends andchanges in the high-performance computing (HPC)market place. Within the first half of this decadeclusters of PCs and workstations have become the

prevalent architecture for many HPC application areas inall ranges of performance. However, the Earth Simulatorvector system demonstrated that many scientificapplications could benefit greatly from other computerarchitectures. At the same time, there is renewed broadinterest in the scientific HPC community for newhardware architectures and new programmingparadigms. The IBM BlueGene/L system is one earlyexample of a shifting design focus for large-scale system.The DARPA HPCS program has the declared goal ofbuilding a Petaflops computer system by the end of thedecade using novel computer architectures.

BACKGROUNDOver the last 50 years, the field of scientific computing hasundergone rapid change—we have experienced a remark-able turnover of technologies, architectures, vendors, andthe usage of systems. Despite all these changes, the long-term evolution of performance seems to be steady andcontinuous, following Moore’s Law rather closely. In 1965,Gordon Moore one of the founders of Intel, conjecturedthat the number of transistors per square inch on inte-grated circuits would roughly double every year. It turnsout that the frequency of doubling is not 12 months, butroughly 18 months [8]. Moore predicted that this trendwould continue for the foreseeable future. In Figure 1, weplot the peak performance over the last five decades ofcomputers that have been called supercomputers. A broad

definition for a supercomputer is that it is one of the fastestcomputers currently available. They are systems that providesignificantly greater sustained performance than that availablefrom mainstream computer systems. The value of supercom-puters derives from the value of the problems they solve, notfrom the innovative technology they showcase. By performance,we mean the rate of execution for floating-point operations.Here we chart Kilo-flop (thousands of floating-point operations)per second (KFlop/s), Mega-flop (millions of floating-pointoperations) per second (MFlop/s), Giga-flop (billions of floating-point operations) per second (GFlop/s), Tera-flop (trillions offloating-point operations) per second (TFlop/s), and Peta-flop(1,000 trillions of floating-point operations) per second(PFlop/s). Figure 1 shows clearly how well Moore’s Law hasheld over almost the complete lifespan of modern computing—we see an increase in performance averaging two orders of mag-nitude every decade.

In the second half of the 1970s, the introduction of vectorcomputer systems marked the beginning of modern super-computing. A vector computer or vector processor is amachine designed to efficiently handle arithmetic operationson elements of arrays, called vectors. These systems offered aperformance advantage of at least one order of magnitudeover conventional systems of that time. Raw performance wasthe main, if not the only, selling point for supercomputers ofthis variety. However, in the first half of the 1980s, the inte-gration of vector systems into conventional computing envi-ronments became more important. Only those manufacturersthat provided standard programming environments, operatingsystems, and key applications were successful in getting theindustrial customers that became essential for survival in themarketplace. Performance was increased primarily byimproved chip technologies and by producing shared-memorymultiprocessor systems, sometimes referred to as symmetricmultiprocessors (SMPs). An SMP is a computer system thathas two or more processors connected in the same cabinet,managed by one operating system, sharing the same memory,and having equal access to input/output devices. Applicationprograms may run on any or all processors in the system; theassignment of tasks is decided by the operating system. Oneadvantage of SMP systems is scalability; additional processorscan be added as needed up to some limiting factor determinedby the rate at which data can be sent to and from memory.

Fostered by several government programs, scalable parallelcomputing using distributed memory became a focus of inter-est at the end of the 1980s. A distributed memory computer sys-tem is one in which several interconnected computers sharethe computing tasks assigned to the system. Overcoming thehardware scalability limitations of shared memory was the maingoal of these new systems. The increase of performance of stan-dard microprocessors after the reduced instruction set comput-er (RISC) revolution, together with the cost advantage oflarge-scale parallelism, formed the basis for [1]. The transitionfrom emitted coupled logic (ECL) to complementary metal-oxide semiconductor (CMOS) chip technology and the use ofoff-the-shelf commodity microprocessors instead of custom

processors for massively parallel processors (MPPs) was theconsequence. The strict definition of MPP is a machine withmany interconnected processors, where many is dependent onthe state of the art. Currently, the majority of high-endmachines have fewer than 256 processors, with most on theorder of 10,000 processors. A more practical definition of anMPP is a machine whose architecture is capable of having manyprocessors—i.e., it is scalable. In particular, machines with adistributed memory design (in comparison with shared memo-ry designs) are usually synonymous with MPPs since they arenot limited to a certain number of processors. In this sense,many is a number larger than the current largest number ofprocessors in a shared-memory machine.

STATE OF SYSTEMS TODAYThe acceptance of MPP systems not only for engineeringapplications but also for new commercial applications, espe-cially for database applications, emphasized different criteriafor market success such as stability of the system, continuityof the manufacturer, and price/performance ratio. Success incommercial environments is now a new important require-ment for a successful supercomputer business. Due to thesefactors and the consolidation in the number of vendors in themarket, hierarchical systems built with components designedfor the broader commercial market are currently replacinghomogeneous systems at the very high end of performance.Clusters built with off-the-shelf components also gain moreand more attention. A cluster is a commonly found computingenvironment consisting of many PCs or workstations connect-ed together by a local-area network. PCs and workstations,which have become increasingly powerful over the years, cantogether be viewed as a significant computing resource. Thisresource is commonly known as a cluster of PCs or worksta-tions, and can be generalized to a heterogeneous collection ofmachines with arbitrary architecture.

At the beginning of the 1990s, while multiprocessor vectorsystems reached their widest distribution, a new generation ofMPP systems came on the market, claiming to equal or evensurpass the performance of vector multiprocessors. To providea more reliable basis for statistics on high-performance com-puters, the Top500 [4] list was started. This report lists the sitesthat have the 500 most powerful installed computer systems.The best LINPACK benchmark performance [9] achieved isused as a performance measure to rank the computers. TheTop500 list has been updated twice a year since June 1993 (Fig-ure 2). In the first Top500 list in June 1993, there were already156 MPP and single-instruction, stream-multiple data stream(SIMD) systems present (31% of the total 500 systems).

The year 1995 saw remarkable changes in the distributionof the systems in the Top500 according to customer types(academic sites, research labs, industrial/commercial users,vendor installations, and confidential sites). Until June 1995,the trend in the Top500 data was a steady decrease of industrialcustomers, matched by an increase in the number of govern-ment-funded research sites. This trend reflects the influenceof governmental HPC programs that made it possible for

23 ■IEEE CIRCUITS & DEVICES MAGAZINE ■ JANUARY/FEBRUARY 2006

research sites to buy parallel systems, especially systems withdistributed memory. The industry was understandably reluc-tant to follow this path, since systems with distributed memo-ry have often been far from mature or stable. Hence,industrial customers stayed with their older vector systems,which gradually dropped off the Top500 list because of lowperformance.

Beginning in 1994, however, companies such as SGI, Digi-tal, and Sun began selling SMP models in their workstationfamilies. From the very beginning, these systems were popularwith industrial customers because of the maturity of thearchitecture and their superior price/performance ratio. At thesame time, IBM SP systems began to appear at a reasonablenumber of industrial sites. While the IBM SP was initiallyintended for numerically intensive applications, in the secondhalf of 1995, the system began selling successfully to a largercommercial market, with dedicated database systems repre-senting a particularly important component of sales.

It is instructive to compare the growth rates of the perfor-mance of machines at fixed positions in the Top500 list withthose predicted by Moore’s Law. To make this comparison, weseparate the influence from the increasing processor perfor-mance and from the increasing number of processors per system on the total accumulated performance. (To get mean-ingful numbers we exclude the SIMD systems for this analysis,as they tend to have extremely high processor numbers andextremely low processor performance.) In Figure 3, we plotthe relative growth of the total number of processors and ofthe average processor performance, defined as the ratio oftotal accumulated performance to the number of processors.We find that these two factors contribute almost equally to theannual total performance growth—a factor of 1.82. On aver-age, the number of processors grows by a factor of 1.30 eachyear, and the processor performance by a factor 1.40 per year,compared to the factor of 1.58 predicted by Moore’s Law.

PROGRAMMING MODELSThe standard parallel architectures support a variety ofdecomposition strategies, such as decomposition by task (taskparallelism) and decomposition by data (data parallelism).Data parallelism is the most common strategy for scientificprograms on parallel machines. In data parallelism, the appli-cation is decomposed by subdividing the data space overwhich it operates and assigning different processors to thework associated with different data subspaces. Typically, thisstrategy involves some data sharing at the boundaries, and theprogrammer is responsible for ensuring that this data sharingis handled correctly, i.e., data computed by one processor andused by another is correctly synchronized.

Once a specific decomposition strategy is chosen, it mustbe implemented. Here, the programmer must choose the pro-gramming model to use. The two most common models are:

✦ the shared-memory model, in which it is assumed thatall data structures are allocated in a common spacethat is accessible from every processor

✦ the message-passing model, in which each processor(or process) is assumed to have its own private dataspace and data must be explicitly moved betweenspaces as needed.

In the message-passing model, data is distributed across theprocessor memories; if a processor needs to use data that is notstored locally, the processor that owns that data must explicitlysend the data to the processor that needs it. The latter must exe-cute an explicit receive operation, which is synchronized withthe send, before it can use the communicated data.

To achieve high performance on parallel machines, theprogrammer must be concerned with scalability and load bal-ance. Generally, an application is thought to be scalable iflarger parallel configurations can solve proportionally largerproblems in the same running time as smaller problems onsmaller configurations. Load balance typically means that the

processors have roughly the sameamount of work, so that no one processorholds up the entire solution. To balancethe computational load on a machinewith processors of equal power, the pro-grammer must divide the work and com-munications evenly. This can bechallenging in applications applied toproblems that are unknown in size untilrun time.

FUTURE TRENDSBased on the current Top500 data (whichcover the last 13 years) and the assump-tion that the current rate of performanceimprovement will continue for sometime to come, we can extrapolate theobserved performance and compare thesevalues with the goals of government pro-grams such as the U.S. Department ofEnergy’s (DOE’s) Accelerated Strategic

■ 24 IEEE CIRCUITS & DEVICES MAGAZINE ■ JANUARY/FEBRUARY 2006

1. Moore’s law and peak performance of various computers over time.

2010200019901980197019601950

IBMBG/LEarth Simulator

ASCI WhitePacific

ASCI Red

TMC CM-5 Cray T3D

TMC CM-2

Cray 1

Cray X-MPCray 2

IBM 360/195CDC 7600

CDC 6600

IBM 7090

UNIVAC 1EDSAC 1

1 KFlop/s(103)

1 MFlop/s(106)

1 PFlop/s(1015)

1 TFlop/s(1012)

1 GFlop/s(109)

Computing (ASC) Program, HPC and communications, andthe PetaOps initiative. In Figure 4, we extrapolate theobserved performance using linear regression on a logarith-mic scale. This means that we fit exponential growth to all lev-els of performance in the Top500. This simple curve fit of thedata shows surprisingly consistent results.

Looking even further in the future, we speculate that basedon the current doubling of performance every year to 14months, the first PetaFlop/s system should be availablearound 2009. Due to the rapid changes in the technologiesused in HPC systems, there is currently no reasonable projec-tion possible for the architecture of the PetaFlops systems atthe end of the decade. Even as the HPC market has changedsubstantially since the introduction of the Cray 1 threedecades ago, there is no end in sight for these rapid cycles ofarchitectural redefinition.

There are two general conclusions we can draw from thesefigures. First, parallel computing is here to stay; it is the pri-mary mechanism by which computer performance can keepup with the predictions of Moore’s law in the face of theincreasing influence of performance bottlenecks in conven-tional processors. Second, the architecture of high-performance computing will continue to evolve at a rapid rate;thus, it will be increasingly important to find ways to supportscalable parallel programming without sacrificing portability.This challenge must be met by the development of softwaresystems and algorithms that promote portability while easingthe burden of program design and implementation.

Grid ComputingGrid computing provides for a virtualization of distributedcomputing and data resources, such as processing, networkbandwidth, and storage capacity, to create a single systemimage, providing users and applications seamless access tothe collective resources. Just as an Internet user views a uni-fied instance of content via the Web, a griduser essentially sees a single, large virtualcomputer.

Grid technologies promise to change theway organizations tackle complex computa-tional problems. However, the vision of large-scale resource sharing is not yet a reality inmany areas—grid computing is an evolvingarea of computing, where standards and tech-nology are still being developed to enable thisnew paradigm.

Early efforts in grid computing started asprojects to link U.S. supercomputing sites,but now it has grown far beyond its originalintent. In fact, there are many applicationsthat can benefit from the grid infrastructure,including collaborative engineering, dataexploration, high-throughput computing,and, of course, distributed supercomputing.

Ian Foster [12] defines a grid as a sys-tem that

“coordinates resources that are not subject to central-ized control. . .

(A grid integrates and coordinates resources and users thatlive within different control domains, for example, the user’sdesktop versus central computing; different administrativeunits of the same company; or different companies andaddresses the issues of security, policy, payment, and mem-bership that arise in these settings. Otherwise, we are deal-ing with a local management system.)

“. . . using standard, open, general-purpose protocolsand interfaces. . .”

(A grid is built from multipurpose protocols and interfacesthat address such fundamental issues as authentication,authorization, resource discovery, and resource access. As dis-cussed further later in this article, it is important that theseprotocols and interfaces be standard and open. Otherwise, weare dealing with an application specific system.)

“… to deliver nontrivial qualities of service.”

(A grid allows its constituent resources to be used in a coordi-nated fashion to deliver various qualities of service, relating, forexample, to response time, throughput, availability, and securi-ty, and/or coallocation of multiple resource types to meet com-plex user demands, so that the utility of the combined system issignificantly greater than that of the sum of its parts.)

At its core, grid computing is based on an open set of standardsand protocols, e.g., open grid services architecture (OGSA), whichenable communication across heterogeneous, geographically dis-persed environments. With grid computing, organizations canoptimize computing and data resources, pool them for large-capac-ity workloads, share them across networks, and enable collaboration.


2. Processor design used as seen in the Top500.

Processor Architecture/Systems

Others500

450

400

350

300

250

200

150

100

50

0

1993

1994

Sys

tem

s

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

SIMDVectorScalar

http://www.top500.org/22/06/2005

SUPERCOMPUTER SITES

There are a number of challenges that remain to beunderstood and overcome in order for grid computing toachieve widespread adoption. The major obstacle is the needfor seamless integration over heterogeneous resources toaccommodate the wide variety of different applicationsrequiring such resources.

TRANSFORMING EFFECTON SCIENCE AND ENGINEERING

Supercomputers have transformed a number of science andengineering disciplines, including cosmology, environmentalmodeling, condensed matter physics, protein folding, quan-tum chromodynamics, device and semiconductor simulation,seismology, and turbulence. As an example, consider cosmolo-gy—the study of the universe, its evolution, and structure—where one of the most striking paradigm shifts has occurred.A number of new, tremendously detailed observations deepinto the universe are available from such instruments as theHubble Space Telescope and the Digital Sky Survey [2]. How-ever, until recently, it has been difficult, except in relativelysimple circumstances, to tease from mathematical theories ofthe early universe enough information to allow comparisonwith observations.

However, supercomputers have changed all of that. Now,cosmologists can simulate the principal physical processes atwork in the early universe over space-time volumes suffi-ciently large to determine the large-scale structures predict-ed by the models. With such tools, some theories can bediscarded as being incompatible with the observations.Supercomputing has allowed comparison of theory withobservation, thus transforming the practice of cosmology.

Another example is the DOE’s ASC, which appliesadvanced capabilities in scientific and engineering computing

to one of the most complex challenges in the nuclear era—maintaining the performance, safety, and reliability of theNation’s nuclear weapons without physical testing. A criticalcomponent of the agency’s Stockpile Stewardship Program(SSP), ASC research develops computational and simulationtechnologies to help scientists understand aging weapons,predict when components will have to be replaced, and evalu-ate the implications of changes in materials and fabricationprocesses for the design life of aging weapons systems. TheASC program was established in 1996 in response to theadministration’s commitment to pursue a comprehensive banon nuclear weapons testing. ASC researchers are developinghigh-end computing capabilities far above the current level ofperformance and advanced simulation applications that canreduce the current reliance on empirical judgments byachieving higher resolution, higher fidelity, 3-D physics, andfull-system modeling capabilities for assessing the state ofnuclear weapons.

Parallelism is a primary method for accelerating the totalpower of a supercomputer. That is, in addition to continuingto develop the performance of a technology, multiple copiesare deployed that provide some of the advantages of animprovement in raw performance, but not all.

Employing parallelism to solve large-scale problems is notwithout its price. The complexity of building parallel super-computers with thousands of processors to solve real-worldproblems requires a hierarchical approach—associating mem-ory closely with central processing units (CPUs). Consequently,the central problem faced by parallel applications is managinga complex memory hierarchy, ranging from local registers tofar-distant processor memories. It is the communication ofdata and the coordination of processes within this hierarchythat represent the principal hurdles to effective, correct, and

■ 26 IEEE CIRCUITS & DEVICES MAGAZINE ■ JANUARY/FEBRUARY 2006

3. Performance growth at fixed Top500 rankings.

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

1 Pflop/s

100 TFlop/s

10 TFlop/s

1 TFlop/s

100 GFlop/s

10 GFlop/s

1 GFlop/s

100 MFlop/s

1.65 TFlop/s

2.30 PFlop/s

280 TFlop/s

0.4 GFlop/s

N=1

N=500

59.7 GFlop/s

1.167 TFlop/s

NECEarth Simulator

IBM ASCI White

SUM

Intel ASCIRed Sandia

Fujitsu"NWT" NAL

IBMBlueGene/L

wide-spread acceptance of parallel com-puting. Thus, today’s parallel comput-ing environment has architecturalcomplexity layered upon a multiplicityof processors. Scalability, the ability forhardware and software to maintain rea-sonable efficiency as the number of pro-cessors is increased, is the key metric.

The future will be more complex yet.Distinct computer systems will be net-worked together into the most powerfulsystems on the planet. The pieces of thiscomposite whole will be distinct inhardware (e.g., CPUs), software (e.g.,operating systems), and operational pol-icy (e.g., security). This future is mostapparent when we consider geographically distributed comput-ing on the computational grid [10]. There is great emerginginterest in using the global information infrastructure as acomputing platform. By drawing on the power of HPCresources geographically distributed, it will be possible to solveproblems that cannot currently be attacked by any single com-puting system, parallel or otherwise.

Computational physics applications have been the primarydrivers in the development of parallel computing over the last 20years. This set of problems has a number of features in common,despite the substantial specific differences in problem domain.

✦ Applications were often defined by a set of partialdifferential equations (PDEs) on some domain in spaceand time.

✦ Multiphysics often took the form of distinct physicaldomains with different processes dominant in each.

✦ The life cycle of many applications was essentially con-tained within the computer room, building, or campus.

These characteristics focused attention on discretizations ofPDEs, the corresponding notion of resolution = accuracy, andsolutions of the linear and nonlinear equations generated bythese discretizations. Data parallelism and domain decomposi-tion provided an effective programming model and a readysource of parallelism. Multiphysics, for the most part, was alsoamenable to domain decomposition and could be accomplishedby understanding and trading information about the fluxesbetween the physical domains. Finally, attention was focused onthe parallel computer, its speed and accuracy, and relatively lit-tle attention was paid to input/output (I/O) beyond the confinesof the computer room.

The Holy Grail for software is portable performance. That is,software should be reusable across different platforms and pro-vide significant performance, for example, relative to peak speed,for the end user. Often, these two goals seem to be in oppositionto each other. Languages (e.g., Fortran and C) and libraries (e.g.,message passing interface (MPI) [7] and linear algebra libraries,i.e., LAPACK [3]) allow the programmer to access or expose par-allelism in a variety of standard ways. By employing standards-based, optimized libraries, the programmer can sometimesachieve both portability and high-performance. Tools (e.g.,

SvPablo [11] and Performance Application Programmers Inter-face (PAPI) [6]) allow the programmer to determine the correct-ness and performance of their code and, if falling short in someways, suggest various remedies.

ACKNOWLEDGMENTSThis research was supported in part by the applied mathematicalsciences research program of the Office of Mathematical, Infor-mation, and Computational Sciences, U.S. Department of Ener-gy, under contract DE-AC05-00OR22725 with UT-Battelle, LLC.

REFERENCES[1] E. Brooks, “The attack of the killer micros,” presented at Terafiop Com-

puting Panel, Supercomputing ‘89, Reno, Nevada, 1989.[2] D.G. York et. al., “The sloan digital sky survey: Technical summary,” J.

Astron., vol. 120, no. 3, pp. 1579–1587, Sept. 2000.[3] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra,

J. Du Croz, A. Greenbaum, S. Hammaring, A. McKenney, and D. Sorensen, LAPACK Users’ Guide, 3rd ed. Philadelphia, PA: SIAM, 1999.

[4] “Top500 Report” [Online]. Available: http://www.top500.org/[5] J. Dongarra, K. London, S. Moore, P. Mucci, and D. Terpstra, “Using

PAPI for hardware performance monitoring on linux systems,” in Proc.Conf. Linux Clust.: The HPC Revolution, 2001 [Online]. Available:http://www.linuxclusterinstitute.org/Linux-HPC-Revolution/Archive/2001techpapers.html

[6] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci, “A portable pro-gramming interface for performance evaluation on modern processors,”Int. J. High Perf. Comput. Appl., vol. 14, no. 3, pp. 189–204, 2000.

[7] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra, MPI:The Complete Reference. Cambridge, MA: MIT Press, 1996.

[8] E. Moore, “Cramming more components onto integrated circuits,”Gordon Electron., vol. 38, no. 8, pp. 114–117, Apr. 19, 1965.

[9] J.J. Dongarra, “Performance of various computers using standard linearequations software,” Linpack Benchmark Rep., Univ. Tennessee Comput-er Science Tech. Rep. CS-89-85, 2003. Available: http://www.netlib.org/benchmark/performance.pdf

[10] Computational Grids: Blueprint for a New Computing Infrastructure,I. Foster and C. Kesselman, Ed. San Mateo, CA: Morgan Kaufman, 1998.

[11] L. DeRose and D.A. Reed, “SvPablo: A multi-language architecture-independent performance analysis system,” in Proc. Int. Conf. ParallelProcessing (ICPP’99), Fukushima, Japan, Sept. 1999, pp. 311–318.

[12] I. Foster, “What is the grid? A three point checklist.,” GRIDToday, vol.1, no. 6, July 20, 2002 [Online]. Available: http://www.grid4today.com/02/0722/100136.html

Jack Dongarra is with the University of Tennessee and OakRidge National Laboratory. E-mail: [email protected].


4. Extrapolation of Top500 results.

1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015

1 Eflop/s

100 Pflop/s

1 Pflop/s

10 Pflop/s

100 Tflop/s

10 Tflop/s

1 Tflop/s

100 Gflop/s

10 Gflop/s

1 Gflop/s

100 Mflop/s

DARPA

HPCS

N=500

N=1

SUM

Documents

A Historical Overview and Examination of Future Developments › ... › 172_2006_trends-in-high-performance-computin… · n this article, we analyze major recent trends and changes