Ethernet in the Worldâ€™s Top500 Supercomputers

Ethernet in the World’s Top500 Supercomputers Updated June 2006

© 2 0 0 6 F O R C E 1 0 N E T W O R K S , I N C . [ P A G E 1 O F 1 1 ]

PAPERWhite

Introduction: The Top500 Supercomputing EliteThe world’s most powerful supercomputers continue to get faster. According to top500.org, which maintains the list of the500 supercomputers with the highest Linpack performance, the aggregate performance of the listed computers has grown21% in the last seven months and 65% in the last year. This growth rate is slower than for other recent lists, but continues tocompare favorably with the rate of improvement predicted by Moore’s Law (2x every 18 months).

The #1 supercomputer on the June 2006 list is still the BlueGene/L whose performance is unchanged at 280.6 TeraFlops,while the #500 supercomputer comes in at 2.026 TeraFlops vs 1.646 TeraFlops in November. The somewhat slower rate ofperformance improvement is notable throughout the list. At the top, seven of the Top10 systems from the November 2005 listwere able to maintain a Top10 position. For the November 2005 list, only 4 systems from the June 2005 list held onto theirTop 10 status. At the bottom of the new list, performance improvement caused 158 systems from November to be de-listedcompared to more than 200 systems that fell off the previous time.

All 500 listed supercomputers use architectures that employ large numbers of processors, (from as many as 131,072 to asfew as 40), to achieve very high levels of parallel performance. A typical modern supercomputer is based on a large numberof networked compute nodes dedicated to parallel execution of applications plus a number of I/O nodes that deal with external communications and with access to data storage resources.

Top500.org categorizes the supercomputers on their list in the following way:

Clusters: Parallel computer systems assembled from commercially available systems/servers and networking devices, witheach compute or I/O node a complete system capable of standalone operation. The current list includes 364 cluster systems(up from 360 in 11/05), including the #6 and #7 in the Top 10.

Constellations: Clusters in which the number of processors in a multi-processor compute node (typically an n-way SymmetricMulti-Processing or SMP node) exceeds the number of compute nodes. There are 38 constellations listed in the Top500 (up from 36 in 11/05), with the highest performing system listed at #5. Not counting the #5 computer, the highest performingconstellation is at #67.

Massively Parallel Processors (MPP): Parallel computer systems comprised in part of specialized, purpose-built nodes and/or networking systems that are not commercially available as separate components. MPPs include vector supercomputers,DM-MIMD (distributed memory-multiple instruction stream, multiple data stream), and SM-MIMD (shared memory-multipleinstruction stream, multiple data stream) supercomputers available from HPC computer vendors such as IBM, Cray, SGI, andNEC. MPP systems account for 98 entries on the current list (down from 104 in 11/05), including #1 through #4 and threemore of the Top 10.

For a more detailed discussion of supercomputer designs and topologies, see the following white papers on the Force10 Networks website:Building Scalable, High Performance Cluster/Grid Networks: The Role of Ethernet (www.force10networks.com/applications/roe.asp)Ethernet in High Performance Computing Clusters (www.force10networks.com/products/ethernetclustering.asp)

Among the major trends in recent Top500 lists is the emergence of highly scalable switched Gigabit Ethernet as the mostwidely deployed system interconnect (SI). The remainder of this document focuses on the key roles that Ethernet plays as amulti-purpose interconnect technology used by virtually all supercomputers on the Top500 list.

The Rise of Clusters and Ethernet Cluster Interconnect

Cluster systems have become the dominant category ofsupercomputer largely because of the unmatched price/performance ratio they offer. As shown by the top curve inFigure 1, the total number of clusters on the list continuesto grow, with an increase of 26% in the last two years.

As clusters have become both more powerful and morecost-effective, they have helped to make High PerformanceComputing (HPC) considerably more accessible to corpora-tions for speeding up numerous parallel applications inresearch, engineering, and data analysis. As shown by themiddle curve in Figure 1, the number of Top500 clustersowned by industrial enterprises has grown by 53% over the last two years. The Top500 list places supercomputerowners in the following categories: industry, research, government, vendor, academic, and classified.

Over the last three years, the adoption of HPC clusters byindustrial enterprises has spurred a 26% increase in thetotal number of industrial systems on the Top500, as shownby Figure 2. Clusters have now become the dominant computer architecture for the industrial component oftheTop500. In June 2006, clusters account for over 88% of the 257 industrial supercomputers on the list.

The cost-effectiveness of clusters as a category of super-computer is driven by three major factors:

1. Availability of high volume server products incorporating very high performance single and multi-core microprocessors, minimizing hardware costs.Enterprises can even build high performance clustersusing the same models of server already being deployedin the data center for mainstream IT applications.


2. Linux is the cluster operating system of choice, minimiz-ing software licensing costs. Linux is the operating systemused by 367 supercomputers on the list, up from 334 oneyear ago. Most Linux systems on the list are clusters,although some MPPs, including the IBM BlueGenes andCray XT3s, also run Linux. Because Linux is increasinglypopular as an enterprise server operating system, no newexpertise is required and applications can readily bemigrated from conventional Linux servers to Linux clusters.

3. Gigabit Ethernet (GbE) is the most cost-effective network-ing system for cluster interconnect (inter compute-nodecommunication). GbE is particularly attractive to enterprisesbecause it is already a familiar mainstream technology indata centers and campus LAN networks. In addition, mosthigh end Linux servers come with integral GbE at no extracost. As shown by the bottom curve on Figure 1, GbE is the cluster interconnect for 94% of the industrial enterpriseclusters on the Top500 (212 out 226). The top performingGbE cluster, at #27on the list. with performance of 12.3TeraFlops, is a Geoscience industry computer built by IBM.The system is a BladeCenter LS20 with 5,000 AMDOpteron processor cores.

Networking for Supercomputers

Regardless of whether the supercomputer architecture isa cluster, constellation, or MPP, the computer nodes thathouse the multiple processors must be supported by anetwork or multiple networks to provide the system connections for the following functions:

IPC Fabric: Also known as simply the "Interconnect", an essential aspect of multi-processor supercomputing is the Interprocessor communications (IPC) that allowlarge numbers of processors/compute nodes to work in

Figure 1. Growth of cluster systems on the Top500

Figure 2. Growth of industry-owned systems on the Top500

network connection that has the primary impact on system performance. Figure 3 shows the number of super-computers on recent Top500 lists that use each type ofIPC Interconnect fabric. The continued rapid growth ofGbE to its current position as the #1 Interconnect fabric inthe Top500 (51% of the 6/06 list) reflects the acceleratedadoption of clusters discussed previously. Virtually all theGbE IPC fabric systems shown in the chart are clusters.

In Figure 3, the "Other" category includes vendor-specific Interconnect fabrics that computer vendorsincorporate in their MPP products, such as the IBM and Cray 3D torus networks, IBM’s SP and Federationnetworks, the NEC crossbar, and the SGI NumaLink.Myrinet and Quadrics are commercially available proprietary HPC switching systems that were bothspecifically designed to provide low latency IPC


a parallel, yet coordinated, fashion. Depending on theapplication, the bandwidth and latency of transfersbetween processors can have a significant impact onoverall performance. For example, processors may wastetime in an idle state waiting to receive intermediaryresults from other processors. All compute nodes areconnected to the IPC fabric. In some cases, I/O nodesare also connected to this fabric.

Management Fabric: A separate management fabricallows system nodes to be accessed for control, trouble-shooting and status/performance monitoring withoutcompeting with the IPC for bandwidth. In general, every compute node and I/O node is attached to themanagement fabric.

I/O Fabric. This fabric connects the supercomputer I/Onodes to the outside world, providing user access andconnection to complementary computing resources over the campus LAN, WAN, or Internet.

Storage Fabric: A common practice is to attach fileservers or other storage subsystems to I/O nodes. Thisisolates the compute nodes from the overhead of storage access. A separate storage fabric may be used to provide connectivity between I/O nodes and fileservers. In a few cases, the compute nodes areattached to storage resources via a SAN, which thenacts as the storage fabric.

Top500.org focuses most of its attention on theInterconnect (IPC) fabric because that is obviously the

Figure 3. Growth of Gigabit Ethernet vs. other IPC interconnectsin Top500

Rank

1

2

3

4

5

6

7

8

9

10

System

LLNLBlueGene/L, IBM

IBM WatsonResearch CenterBlueGene/W, IBM

LLNLASC Purple, IBM

NASA ColumbiaSGI Altix

CEATera-10, Bull

Sandia National LabsThunderbird, Dell

GSIC CentreTsubame, NEC/Sun

FZJ / JUBLBlueGene, IBM

Sandia National LabsRed Storm, Cray XT3

Earth Simulator NEC

Processors/System Type

131,072 DM-MIMDMPP

40,960 DM-MIMDMPP

10,240 MPP

10,160 SM-MIMDMPP

8,704 Cluster

8,000Cluster

10,368Cluster

16,384 DM-MIMDMPP

10,880 DM-MIMDMPP

5,120 VectorMPP

InterconnectFabric

3D Torus + Tree +Barrier


Federation switch

NunaLink + 20 x IB+ 40 x 10GbE

Quadrics

InfiniBand

InfiniBand


3D Torus

640 x 640 crossbar

Control/Mgmt

FE on 65,536compute nodes

FE on 20.480 compute nodes

FE on 1,536 computeand I/O nodes

GbE on 20Supernodes

FE/GbE on 600 nodes

(same as I/O)

(same as I/O)

FE on 8,192compute nodes

2,582 x FE

GbE on 640 vectorSC Nodes

ExternalNetwork I/O

GbE on 1,024 I/Onodes

GbE on 320 I/Onodes

32x 10 GbE + 8GbE

Via 10 GbE fabric

GbE on 56 I/Onodes

GbE on 4600nodes

24x GbE and 10GbE via InfiniBand

GbE on 288 I/Onodes

40 x 10 GbE

GbE on 640 vector SC nodes

Storage

(same as I/O)

(same as I/O)

FibreChannel

Via 10 GbE fabric

FibreChannel

(same as I/O)

InfiniBand

(same as I/O)

80 x 10 GbE

FibreChannel on 640 vector SC nodes

TeraFlops

280.6

91.3

75.8

51.9

42.9

38.3

38.2

37.3

36.2

35.9

Figure 4. System interconnects of the Top 10 supercomputers (Top500 list)

system interconnect. InifinBand is an industry standard,low latency, general purpose system interconnect.InfiniBand is the only IPC interconnect besides GbEthat is capturing a growing share of the Top500 list. Asthis chart indicates, the Top500 list, as a whole, is mov-ing away from proprietary interconnect technologies.

On the June 2006 list, all but one of the Quadrics systems are listed as clusters, while Myrinet systemsinclude 53 clusters and 33 constellations (i.e., nearly allthe constellations in the Top500 use Myrinet as the IPCInterconnect fabric). The recent decline in the number of Myrinet systems on the list is due partly to growth of the Ethernet cluster and partly to the decline in thenumber of constellations that make the list (down to 38 in June 2006 from 70 in June 2005).

Networking for the Top 10 Supercomputers

Figure 4 provides an overview of the Top 10 systems on the June 2005 top500.org list: with respect to the networking fabrics deployed in each of the functionalareas described above.

As can be noted from column four of the table, all of thesystems in the Top 10, use low latency IPC Interconnectfabrics to help achieve high performance. The six MPPsystems in the table rely on the computer vendors’ proprietary interconnects, the clusters use commerciallyavailable interconnect (InfiniBand or Quadrics). It isnotable that commercially available proprietary inter-connects are losing popularity in the Top10 as well asthroughout the list.

Figure 5 summarizes typical performance levels for these more specialized interconnects as well as GigabitEthernet and 10 Gigabit Ethernet in terms of MessagePassing Interface (MPI) latency and bandwidth. There

are a number of Ethernet NIC technologies now on themarket (RDMA, TOE, kernel bypass, etc.) that reduce thehost component of IP/Ethernet MPI latency. Therefore,MPI latency for Ethernet is expected to continue todecline towards a figure closer to the switch latency onthe order of 10 microseconds . The impact of TOE NICsis seen in the last row of the table. The TOE data comesfrom recent testing of 10 Gigabit Ethernet cluster inter-connect by Chelsio Communications (www.chelsio.com),a leading Ethernet NIC supplier, and Los Alamos NationalLaboratory. The results demonstrate compelling perfor-mance levels compared with Myrinet and Infiniband.

10 GbE switches and TOE NICs are now available at volume price levels with additional improvements tocome over the next couple of years. As these technolo-gies ride further down the cost curve, Ethernet clustersshould be able to continue to enhance their share of theTop500 by delivering ever-improving performance evenwithout significant increases in processor counts.

For a more detailed discussion of supercomputer designsand topologies, see the following white papers on theForce10 Networks website:

Building Scalable, High Performance Cluster/Grid Networks: The Role of Ethernet (www.force10networks.com/applications/roe.asp)

Ethernet in High Performance Computing Clusters (www.force10networks.com/products/ethernetclustering.asp)

As shown in columns five and six of Figure 4, all of theTop 10 supercomputers use switched Gigabit Ethernet orFast Ethernet networks as the management fabric andgeneral I/O fabric. Gigabit Ethernet is also the predominantfabric used to connect I/O nodes to file server resources.Therefore, although none of these Top 10 systems is categorized as a system with GbE interconnect, all of


Source: IBM, NEC, Sandia, Chelsio, and SGI (www.sgi.com/products/servers/altix/numalink.html)

Figure 5. Latency and bandwidth of IPC interconnect fabrics

Technology

NumaLink 4

3D Torus

QsNet II

SeaStar 3D Torus

InfiniBand 4X

640 x 640 crossbar

Myrinet XP2

Gigabit Ethernet

10 Gigabit Ethernet

Vendor

SGI

IBM

Quadrics

Cray

Voltaire

NEC

Myricom

Various

Various

MPI Latency (msec)short message single hop

1

1.5

2

2

3.5

n/a

5.7

30 (non-TOE)

10-14 (with TOE)

MPI Bandwidth (MB/s)(unidirectional)

3,200

175

900

6,000

830

12,300

495

100 (non-TOE)

863 (with TOE)


IPC fabric connecting compute nodes and the I/O nodes.Additional 10 GbE or GbE ports in the mesh of switchesserve as the Storage Fabric that provides connectivitybetween the I/O nodes and file servers. Another set ofswitch ports and logical interfaces can play the role ofan I/O fabric connecting the supercomputer I/O nodes to external resources and users.

A separate set of meshed Fast Ethernet switches can be used to construct a out-of-band management fabric. Fast Ethernet has more than adequate bandwidth for themanagement fabric function and is very inexpensive.Frequently the management fabric can be built using re-purposed high density Fast Ethernet switches previouslyused for server connectivity in data centers or earliergenerations of cluster.

the systems make extensive use of Fast Ethernet, GigabitEthernet, and/or 10 Gigabit Ethernet for non-IPC systeminterconnect. A more complete description of the Top 10Supercomputers on the list is included toward the end ofthis document.

If the other 490 systems in the Top500 were examined as closely, we would expect to see that scalable Ethernetswitching always plays an important system interconnectrole in more than one of the four required functional areas.

Top500 Performance by System Type

Figure 6 is a column chart that shows the Linpack performance of all systems in the Top500, where the typeof system is identified by the color of the column. Clustersare increasingly dominant between #100 and #500 onthe list, accounting for 80% of the systems. In addition,clusters have made significant inroads among the Top 100positions on the list. Clusters now occupy 45 positions inthe Top100, including 37 with low latency interconnectsand 7 with GbE interconnect. If current trends continue,we can expect to see clusters becoming even more dominant for at least the next one or two list iterations.

Networking for Gigabit Ethernet Clusters

For supercomputer clusters that use GbE as theInterconnect fabric, switched Ethernet technology can be chosen to satisfy all of the system networking require-ments. Figure 7 provides a conceptual example of howthis may be done.

Highly scalable switches with 10 GbE and GbE ports areconnected in a mesh forming a "fat tree" serving as the

Figure 6. Performance of Top500 computers by system architecture

Figure 7. Cluster using Ethernet for all four system interconnect fabrics


The Top 10 Supercomputers on the June 2006 Top500 List

This section provides additional information on the systems in the Top 10. Information is limited to thatwhich the owners or vendors of the systems haveplaced on their web sites or elsewhere on the Internet.Links to some of these information sources are includedin the Appendix at the end of the document.

#1 Lawrence Livermore National Labs (LLNL)Blue Gene/LThe highest performing supercomputer on the Top500 listis the LLNL Blue Gene/L, whose performance has risen to280.6 TeraFlops from 137 TeraFlops five months ago. BlueGene/L is an MPP system with 65,536 dual-processorcompute nodes and 1,024 I/O nodes. The compute nodesrun a stripped down version of the Linux kernel and theI/O nodes run a complete version of the Linux operatingsystem. The full system consists of 64 racks with each rackhousing 1,024 compute nodes and 16 I/O nodes.

Blue Gene/L (BG/L) uses three specialized networks forIPC: a 3D torus with 1.4 Gbps bidirectional bandwidthfor the bulk of message passing via MPI, a tree networkfor collective operations, and a synchronizationbarrier/interrupt network. Interfaces for all three of thesenetworks are integrated on the node processor ASICs asshown in Figure 8.

In addition to the IPC networks, further connectivity isprovided by two separate Ethernet networks, as shown inFigure 9. Each compute node has a 100/1000 Ethernetinterface dedicated to control and management, includingsystem boot, debug, and performance/health monitoring(control information can also be transmitted via the ASIC’sJTAG interface). Each of the 1,024 I/O nodes uses GigabitEthernet for file access and external communications.

Therefore, the LLNL BG/L system incorporates 65,536ports of Fast Ethernet or GbE in the control managementnetwork and 1.024 ports of GbE in the I/O and file servernetwork.

One of the key design guidelines for the BG/L was tooptimize performance per watt of power consumed ratherthan maximizing performance per processor. The result isthe ability to integrate 1,024 dual-processor computenodes into a rack 0.9 m wide, 0.9 m deep, and 1.9 mhigh that consumes 27.5 kW of total power. For example,BG/L yields 25 times more performance per KW than theNEC Earth Simulator at #7 on the current list.

Because of the large number of nodes in a single rack,more than 85% of the inter-node connectivity is containedwithin the racks. The corresponding dramatic reduction inconnectivity across racks allows for higher density, higherreliability, and a generally more manageable system.

Because the design philosophy led to a very large num-bers of processors, the decision was made to provide thesystem with a very robust set of Reliability, Availability,and Serviceability (RAS) features. The BG/L design team

Figure 8. Block diagram of the Blue Gene/L processor ASIC

Figure 9. High level view of the Blue Gene/L system


was able to exploit the flexibility afforded by an ASIC leveldesign to integrate a number of RAS features typically notfound on commodity servers used in cluster implementa-tions. As supercomputers continue to scale up in processorcount, RAS is expected to become an increasingly criticalaspect of HPC system design.

BG/L has been designed to be applicable to a broadrange of applications in the following categories:

• Simulations of physical phenomena.• Real-time data processing • Off-line data analysis.

Accordingly, IBM has made the BG/L into a standardproduct: line which it intends to sell to both the traditionalHPC market and the broader enterprise market. TheLinux-based IBM eServer Blue Gene Solution is availablefrom 1 to 64 racks with peak performance up to 5.7TeraFlops per rack. A one-rack entry version sells forapproximately $1.5M. This price/performance point islikely to be attractive for enterprises with compute-inten-sive, mission critical applications that can be acceleratedthrough parallelization.

As a result of this eServer Blue Gene Solution initiative,we can expect to see an increasing number of Blue Genesystems appearing on the Top500 list for some time tocome. There are 24 eServer Blue Gene Solution computerson the current list.

#2 IBM Thomas J. Watson Research CenterBlue Gene/WAt #2 on the Top500 list, with performance of 91.3TeraFlops, is another Blue Gene system installed at theIBM Thomas J. Watson Research Center (BG/W). BG/Wuses the same system design as BG/L but is a 20 rack

system with 20,480 compute nodes and 320 I/O nodes.

Therefore, the IBM BG/W system incorporates 20,480ports of Fast Ethernet or GbE in the control managementnetwork and 320 ports of GbE in the I/O and file servernetwork.

#3 Lawrence Livermore National LaboratoryASC Purple IBMAt #3 with upgraded performance of 75.8 TeraFlops forthe June 2006 list is the ASC Purple MPP built by IBMfor LLNL. ASC Purple currently consists of 1,536 nodes,including 1280 compute nodes and 128 I/O nodes.Purple is comprised of 131 node racks, 90 disk racks,and 48 switch racks. Each p575 Squadron IH node is an 8-way SMP server that is powered by eight Power5microprocessors running at 1.9 GHz and is configuredwith 32 GB of memory.

As shown in Figure 10, the ASC Purple IPC interconnectfabric is an IBM 3-stage, dual-plane Federation switchwith 1,536 dual ports. This switch array is built from 480 32-port switches and 9,216 cables. The fabric provides 8 GBps of peak bi-directional bandwidth.

Purple has 2 million gigabytes of storage furnished bySATA and FibreChannel RAID arrays with over 11,000disks. More than 2,000 FibreChannel2 links are requiredfor storage access.

In addition, the system has two Squadron 64-way Power5nodes Logically Partitioned into four login nodes. Eachlogin node has eight 10 GbE ports for parallel FTP accessto the archive and 2 GbE ports for NFS and SSH (login)traffic. System management functions are facilitated witha separate Ethernet management fabric with over 1,536Fast Ethernet ports.

Figure 10. High level view of the ASC Purple system


#4 NASA Columbia SGI Altix 3700The fourth system on the Top500 list at 51.9 TeraFlops isthe NASA Columbia system consisting of 20 SGI Altix3700 Superclusters, as shown in Figure11. Each Super-cluster contains 512 Itanuium 2 processors with 1 Terabyteof global shared memory across the cluster. Each Super-cluster runs a single image of the Linux operating system.

The primary fabric for IPC is NumaLink, a low-latencyproprietary SGI interconnect with low latency and 24 Gbpsof bidirectional bandwidth. Each supernode is also con-nected with InfiniBand and two 10 Gigabit Ethernet portsfor I/O and Storage system access. Therefore, this systemdesign requires 40 ports of 10 Gigabit Ethernet switching.

#5 Commissariat à l’Énergie Atomique (CEA)Tera-10The fifth computer on the list is the Tera-10 supercomputerwith performance of 42.9 Teraflops owned by the Frenchnuclear energy agency. The Tera-10 is a Linux cluster ofBull NovaScale 602 servers consisting of 544 computenodes and 56 I/O nodes, as shown conceptually in Figure12. Each NovaScale 602 server node has 16 Itanium-2

processors in an SMP configuration. The SMP is based onFAME (Flexible Architecture for Multiple Environments)internal switches that are used to provide individualprocessors with access to I/O and shared memory. Notethat the system has been incorrectly categorized as a constellation in the Top500 list.

Quadrics QSnet II provides the IPC fabric connectingcompute and I/O nodes, FibreChannel on the I/O nodesis used for storage connect, and Ethernet is used for dataI/O and management.

#6 Sandia National Labs Thunderbird DellPowerEdge ClusterAt the #6 position, the Sandia Thunderbird is the secondhighest performing cluster on the list with 38.3 TeraFlopsof performance. Thunderbird is constructed of 4,096compute-nodes consisting of Dell PowerEdge servers.Each PowerEdge 1850 1U server has two single-coreIntel 64-bit (EM64T) Xeon 3.6 MHz processors, for atotal of 8,192 processors.

The IPC fabric is provided by 10 Gbps InfiniBand. Alarge switched Ethernet network with 4600 GbE portsand forty 10 GbE ports serves as the management fabric,I/O fabric, and storage fabric of the cluster. The manage-ment fabric spans the compute nodes, the InfiniBandswitches, and the storage nodes.

#7 Tokyo Institute of Technology TSUBAMESun Fire ClusterThe #7 computer on the list at 38.2 TeraFlops is theTokyo Tech TSUBAME based on 655 Sun Fire x64 serverswith a total of 10,480 AMD Opteron processor coresand Sun InifiniBand-attached storage. Each Sun Fire usesa Galaxy 4 8-way SMP processor configuration.

All nodes are interconnected via InfiniBand DDR (20Gbps) for IPC communications, as well as for storageinterconnect and network I/O via an InfiniBand/EthernetGateway. The TSUBAME, therefore, is based on the version of a converged server fabric being promoted byInfiniBand vendors.

The Sun Fire servers also use ClearSpeed's Advancefloating-point co-processors to accelerate floating pointoperations. The Advance board can reportedly deliver 25 GigaFlops of number-crunching performance and onlyconsume 10 watts of power. The Advance co-processoris a multi-core special parallel processor implemented asa system on a chip. The coprocessor uses a MultiThreadedArray Processor (MTAP), with 96 floating point cores,and a high-speed network interconnecting them anddedicated DDR2 memory.

Figure 11. NASA’s Columbia System

Figure 12. CEA’s Tera-10 System


Clusters based on multi-core SMP compute nodes andmulti-core co-processors, perhaps interconnected in agrid of clusters, appear to be a fruitful direction in thepursuit of higher performance that is less constrained byeither physical size or power consumption difficulties.

#8 Forschungszentrum Juelich (FZJ) JUBLBlueGene/LAt #8 on the Top500 list, with performance of 37.3TeraFlops, is another Blue Gene system installed at FZJThe JUelicher BlueGene/L (JUBL) uses the same systemdesign as BG/L but is a 8 rack system with 8,192 computenodes and 288 I/O nodes.

Therefore, the JUBL system incorporates 8,192 ports ofFast Ethernet or GbE in the control/management networkand 288 ports of GbE in the I/O and file server network.

#9 Sandia National Labs Red Storm Cray XT3Sandia worked closely with Cray to develop Thor’sHammer, the first MPP supercomputer to use the RedStorm architecture. Cray has now leveraged the design tocreate its next generation MPP product: the Cray XT3supercomputer. Thor’s Hammer is now listed as the RedStorm Cray XT3.

Red Storm is currently comprised of 5,184 dual-coreOpteron processor compute nodes housed in 108 cabinets.In addition, there are 256 service and I/O nodes housedin 16 cabinets. The compute nodes run a microkernelderived from Linux developed at Sandia, while the service

Figure 13. High level view of the Sandia Red Storm MPP

and I/O nodes run a complete version of Linux. The system architecture allows the number of processors tobe increased to 30,000 processors, potentially uppingperformance from the current 36.2 TeraFlops.

The installation at Sandia will operate as a partitionednetwork configuration with a classified section (Red) andunclassified section (Black), as shown in Figure 13 Themachine can be rapidly reconfigured to switch 50% ofall the compute nodes between the classified or unclas-sified sections. In Figure 13, the switchable computecabinets are shown in white. In normal operation, three-quarters of the compute nodes are in either the Red orBlack section.

Red Storm uses a 3D Torus 27 x 16 x 24 IPC fabric tointerconnect for its compute nodes. The peak bi-directionalbandwidth of each link is 7.6 GBps with a sustainedbandwidth in excess of 4 GBps The torus leverages theOpteron’s HyperTransport interfaces and is based onCray’s SeaStar chip. The Cray torus interconnect carries allmessage passing traffic as well as the traffic among thecompute nodes and the I/O nodes as shown in Figure 13.

As with other clusters, Fast Ethernet is used for manage-ment of the compute and I/O nodes for a total of over2,582 ports. In addition, Red Storm will incorporatemore than 80 ports of 10GbE to connect the system tofile servers and another 40 ports for external I/O to othercomputing resources such as a new "VisualizationCluster" for 3D modeling.

© 2 0 0 6 F O R C E 1 0 N E T W O R K S , I N C . [ P A G E 1 0 O F 1 1 ]

#10 NEC Earth Simulator System (ESS)The #10 system is the NEC Earth Simulator, in Japan. TheEarth Simulator is a special purpose MPP machine, madeby NEC with the same vector processing technology usedin the NEC SX-6 commercial product. The decision byNEC to base the design entirely on vector processors wassomething of a departure from previous approaches tosupercomputer design.

The Earth Simulator consists of 640 shared memory vectorsupercomputers that are connected by a massive high-speed interconnect network. The interconnection network(IN) consists of a 640 x 640 single-stage crossbar switchwith approximately 100 Gbps of bi-directional bandwidthper port. The aggregate switching capacity of this inter-connect network is over 63 Tbps. This high level of perfor-mance was achieved by splitting the switch into 128 dataswitch units, each consisting of a byte-wide 640 x 640switch. The 128 data switch units are housed in 65 racksand require over 83,000 cables.

Each supercomputer node contains eight vector processorswith a peak performance of 8GFlops and a high-speedmemory of 16 GBytes. The total number of processors is5120 (8 x 640), which translates to a total of approximately40 TeraFlops peak performance, and a total main memoryof 10 Terabytes.

However, the SX-6 processors consume considerable

power and space. With only16 processors per rack, 320racks are required for the processors alone. A specialbuilding 65m x 50m in area was constructed to houseNSS as shown in Figure 14.

The system layout for the ESS is similar to the oneshown in Figure 15, which is from a large NEC SX-8based system that adheres to the same general architec-ture. The Compute nodes are connected by threeswitched networks: the 640 x 640 crossbar (IN or IXS),GbE for I/O and management, and a Fibre ChannelSAN for storage access. Therefore, the ESS uses a totalof 640 ports of GbE switching.

Figure 15. Earth Simulator block diagram

Figure 14. Special building for the Earth Simulator

Conclusion

Switched Ethernet technology is making an increasinglysignificant contribution to the advancement of supercom-puting and HPC. Within the Top500 Ethernet hasachieved the following milestones:

• GbE is now the leading IPC fabric (used by 51% of the supercomputers on the list)

• GbE is the leading IPC fabric for clusters (69% of clusters use GbE)

• The cost-effectiveness of GbE is helping make super-computing accessible to more industrial enterprises. (94% of all industrial clusters use GbE for the IPC fabric)

• Driven by GbE cluster technology, supercomputing is being more widely adopted by industry. With the growth in industrial clusters that began in earnest in June 2003, 88% of all industrial supercomputers are now clusters and 51% of the supercomputers on the list are now owned by industrial enterprises.

• Although they are not listed as being based on GbE interconnect, the Top 10 supercomputers in the world make extensive use of high density Fast Ethernet, GbE, and 10 GbE switching for non-IPC fabric functions: management, network I/O, and storage I/O.

The cost-effectiveness and the accessibility of supercom-puting based on GbE clusters has been well demonstratedwithin the Top500. This is encouraging more enterprisesto work to identify opportunities to derive business benefitfrom parallel applications and HPC, even in areas such asfinancial analysis and database processing. As a result,GbE clusters are expected to continue to grow in signifi-cance as both a mainstream technology of the enterprisedata center and as a component of the Top500 list.

Appendix: Links for additional information

Top500 Lists and Databasehttp://top500.org

#1 LLNL Blue Gene/Lhttp://www.research.ibm.com/journal/rd/492/moreira.pdfmore on how the control Ethernet is used:http://www.research.ibm.com/journal/rd/492/gara.pdf

General info on IBM’s Blue Gene/L:http://www.research.ibm.com/journal/rd/492/gara.pdfhttp://www.research.ibm.com/journal/rd/492/coteus.pdf

#2 BGW – IBM’s Blue Gene at Watson:http://www.research.ibm.com/bluegene/conferences/bp

2005/4.BGW_Overview.pdfand something on the apps:http://www.research.ibm.com/bluegene/conferences/bp

2005/7.BGW_Mission_Utilization.pdf

#3 LLNL ASC Purplehttp://www.llnl.gov/asci/platforms/purple/details.html#fig9

#4 NASA Columbiahttp://www.sgi.com/features/2004/oct/columbia/

#5 Commissariat à l’Énergie Atomique (CEA) Tera-10http://www.cea.fr/fr/presse/dossiers/Tera10_janvier2006.pdf

#6 Sandia Thunderbird

#7 Tokyo Institute of Technology TSUBAME Sun Fire Clusterhttp://www.sun.com/smi/Press/sunflash/2006-06/sunflash.

20060628.1.xml#8 Forschungszentrum Juelich (FZJ) JUBL BlueGene/L

http://www.fz-juelich.de/zam/ibm-bgl

#9 Sandia Red Stormhttp://www.cs.sandia.gov/platforms/RedStorm.htmlhttp://www.cs.sandia.gov/platforms/RedStorm_072704 NewsRelease.htmlhttp://www.hotchips.org/archives/hc15/2_Mon/1.cray.pdfhttp://www.foundrynet.com/about/newsevents/releases/

pr10_21_03.htmlhttp://www.csm.ornl.gov/workshops/SOS8/Camp-SOS8.

ppt#6

#10 Earth Simulatorhttp://www.es.jamstec.go.jp/esc/eng/ES/hardware.html

© 2006 Force10 Networks, Inc. All rights reserved. Force10 Networks and the Force10 logo are registered trademarks,and EtherScale, FTOS, SFTOS, and TeraScale are trademarks of Force10 Networks, Inc. All other brand and productnames are trademarks or registered trademarks of their respective holders. Information in this document is subject tochange without notice. Certain features may not yet be generally available. Force10 Networks, Inc. assumes no responsi-bility for any errors that may appear in this document.

WP12 906 v2.0

Force10 Networks, Inc.350 Holger WaySan Jose, CA 95134 USAwww.force10networks.com

408-571-3500 PHONE

408-571-3550 FACSIMILE

© 2 0 0 6 F O R C E 1 0 N E T W O R K S , I N C . [ P A G E 1 1 O F 1 1 ]

Documents

Ethernet in the Worldâ€™s Top500 Supercomputers