Bill

William Kerney

William Kerney

4/29/00

Clusters Separating Myth from Fiction

I. Introduction

Over the last few years, clusters of commodity PCs have become ever more prevalent. Since the early 90s computer scientists have been predicting the demise of Big Iron that is, the custom built supercomputers of the past such as the Cray XMP or the CM-* -- due workstations superior price/performance. Big Iron machines were able to stay viable for a long time since they were able to perform computations that were infeasible on even the fastest of personal computers. In the last few years though, clusters of personal computers have nominally caught up to supercomputers in raw CPU power and interconnect speed, putting three self-made clusters in the top 500 of supercomputers

This has led to a lot of excitement in the field of clustered computing, and to inflated expectations as to what clusters can achieve. In this paper, we will survey three clustering systems, compare some common clusters with a modern supercomputer, and then discuss some of the myths that have sprung up about clusters in recent years.

II. The NOW Project

One of the most famous research efforts in clustered computing was the NOW project at UC Berkeley, which ran from 1994-1998. A case for NOW by Culler et al. is an authoritative statement of why clusters are a good idea they have lowered costs, greater performance, and can even be used as a general computer lab for students.

The NOW cluster physically looked like any other undergraduate computer lab: it had (in 1998), 64 UltraSPARC I boxes with 64MB of main memory each, all of which could be logged into individually. For all intents and purposes they looked like individual workstations that can submit jobs to an abstract global pool of computational cycles. This global pool is provided by way of GLUnix, a distributed operating system that sits atop Solaris and provides load balancing, input redirection, job control and co-scheduling of applications that need to be run at the same time. GLUnix load balances by watching the CPU usage of all the nodes in the cluster; if a user sits down at one workstation and starts performing heavy computations, the OS will notice and migrate any global jobs to a less loaded node. In other words, GLUnix is transparent it appears to a user that he has full access to his workstations CPU at all times with a batch submission system to access spare cycles on all the machines across the lab. The user does not decide which nodes to run on he simply uses the resources of the whole lab.

David Culler and the other developers of NOW also discovered one of the most important ideas to come out of clustered computing Active Messages. Active Messages were devised to compensate for the slower networks that workstations typically use typically 10BaseT or 100BaseT, which get nowhere near the performance of high-performance custom hardware like hypercubes of CrayLinks. In the NOW cluster, when a active data packet arrives the NIC writes the data directly into an applications memory. Since the application no longer has to poll the NIC or copy data out of the NICs buffer, the overall end-to-end latency decreases by 50% for medium-sized (~1KB) messages and from 4ms to 12us for short (1 packet) messages, a 200x reduction in time. A network with active messages running through it has a lower half power-point the message size that achieves half the maximum bandwidth than a network using TCP since active messages have a much smaller latency, especially with short messages. A network with AM hits the half power point at 176 bytes, as compared with 1352 bytes for TCP. When 95% of packets are less than 192 bytes and the mean size is 382 (according to a study performed on the Berkeley network), Active Messages will be very superior to TCP.

The downside to Active Messages is that programs must be rewritten to take advantage of the interface; by default programs poll the network with a select(3C) call and do not set up regions of memory for the NIC to write into. It is not a straightforward conversion from TCP sockets since the application has to set up handlers to get called back when a message arrives for the process. The NOW group worked around this by implementing Fast Sockets, which presents the same API as UNIX sockets, but has an active message implementation beneath.

The results that came out of the NOW project were quite promising. It broke the world record for the datamation disk-to-disk sorting benchmark in 1997, demonstrating that a large number of cheap workstation drives can have a higher aggregate bandwidth than a smaller number of high-performance drives in a centralized server. Also, the NOW project showed that for a fairly broad class of problems the cluster was scalable and could challenge the performance of traditional supercomputers with inexpensive components. Their Active Messaging system, by lowering message delay, mitigated the slowdown caused by running on a cheap interconnect.

III. HPVM

HPVM, or High-Performance Virtual Machine, was a project by Andrew Chien et al. at the University of Illinois at Urbana-Champaign (1997-present) that built in part on the successes of the NOW project. Their goal was similar to PVMs, in that they wanted to present an abstract layer that looked like a generic supercomputer to its users, but was actually composed of heterogeneous machines beneath.

The important difference between HPVM with PVM and NOW is that where PVM and NOW use their own custom API to access the parallel processing capabilities of their system, requiring programmers to spend a moderate amount of effort porting their code, HPVM presents four different APIs which mimic common supercomputing interfaces. So, for example, if the programmer already has a program written using SHMEM the one sided memory transfer API used by Crays then he will be able to quickly port his program to HPVM. The interfaces implemented by HPVM are: MPI, SHMEM, global arrays (similar to shared memory but allowing multi-dimensional arrays) and FM (described below).

The layer beneath HPVMs multiple APIs is a messaging layer called Fast Messages. FM was developed in 1995 as an extension of Active Messages. Since then, AM has been worked on as well, so the projects have diverged slightly over the years though both have independently implemented new features like having more than one active process per node. The improvements FM made to AM include the following:

1) FM allows the user to send messages larger than fit in main memory, AM does not.

2) AM returns an automatic reply to every request sent to detect packet loss. FM implements a more sophisticated reliable delivery protocol and guarantees correct order in the delivery of messages.

3) AM requires the user to specify the remote memory address the message will get written into; FM only requires that a handler be specified for the message.

In keeping with HPVMs goal of providing an abstract supercomputer, it theoretically allows its interface to sit above any combination of hardware that a system admin can throw together. In other words, it would allow an administrator to put 10 Linux Boxes, 20 NT Workstations and a Cray T3D into a virtual supercomputer that could run MPI, FM or SHMEM programs quickly (via the FM underlying it all).

In reality, Chiens group only implemented the first version of HPVM on NT and linux boxes, and their latest version only does NT clustering. A future release might add support for more platforms.

IV. Beowulf Beowulf has been the big name in clustering recently. Every member of the high-tech press has run at least on story on Beowulf: Slashdot, Zdnet, Wired, CNN and others. One of the more interesting things to note about Beowulf clusters is that there is no such thing as a definitive Beowulf cluster. Various managers have labeled their projects Beowulf-Style (like the Stone Soupercomputer) while others will say that a true Beowulf cluster is one that mimics the original cluster at NASA. Yet even others claim that any group of boxes running an open source operating system is a Beowulf. The definition we will use here is: any cluster of workstations which runs Linux with the packages available off the official Beowulf website.

The various packages include:

1) Ethernet bonding this allows multiple Ethernet channels to be logically joined into one higher-bandwidth connection. In other words, if a machine had two 100Mb/s connections to a hub, it would be able to transmit data over the network at 200Mb/s, assuming that all other factors are negligible.

2) PVM or MPI these standard toolkits are what allow HPC programs to actually be run on the cluster. Unless the users has an application whose granularity is so high that it can be done merely with remote shells, he will want to have either PVM or MPI or the equivalent installed.

3) Global PID space This patch allows only one given process id to be in use in any of the linux boxes in the cluster. Thus, two nodes can always agree on what Process 15 is; this helps promote the illusion of the cluster being one large machine instead of a number of smaller ones. As a side effect, the Global PID space patch allows processes to be run on remote machines.

4) Virtual Shared Memory This also contributes to the illusion of the Beowulf cluster being one large machine. Even though each machine in hardware has no concept of a remote memory address as an Origin 2000 does, with this kernel patch a process can use pages of memory that physically exist on a remote machine. When a process tries to access memory not in local RAM, it triggers a page fault, which invokes a handler that fetches the memory from the remote machine.

5) Modified standard utilities they have altered utilities like ps and top to give process information over all the nodes in the cluster instead of just the local machine. This can be thought of as a transparent way of dealing with things typically handled by a supercomputers batch queue system. Where a user on the Origin 2000 would do a bps to examine the state of the processes in the queues, a Beowulf user would simply do a ps and look at the state of both local and remote jobs at the same time. It is up to a users tastes to determine which way is preferable.

A good case study of Beowulf is the Avalon project at Los Alamos National Laboratory. They put together a 70-CPU alpha cluster for 150,000$ in 1998. In terms of peak performance, it scored twice as high as a multi-million dollar Cray with 256 nodes. Peak rate, though, is a misleading performance metric: people will point to the high GFLOPS rate and ignore the fact that those benchmarks did not take communication into account. This leads to claims like the ones that the authors make, that do-it-yourself supercomputing will make vendor-supplied supercomputers obsolete since their price/performance ratio is so horrible.

Interestingly enough, in the two years since that paper was published, the top 500 list of supercomputers is still overwhelmingly dominated by vendors. In fact, there are only three self-made systems on the list, with the Avalon cluster (number 265) being one of them.

Why is that the case?

Although they get a great peak performance three times greater than the Origin 2000 a Beowulf cluster like Avalon doesnt work as well in the real world. Real applications communicate heavily, and a fast Ethernet switch cannot match the better speed of the custom Origin interconnect. Even though Avalon was using an equal number of 533MHz 21164 Alphas as 195Mhz R10ks for the Origin 2000, the NASPAR Class B benchmark rated the O2k at twice the performance. A 533Mhz 21164 specints at 27.9 while the 195Mhz R10k only gets 10.4 This means that, due to the custom hardware on the O2k, it was able to get six times the computing power out of the processors. Although the authors claim a win since their system was 20 times cheaper than the Origin, the opposite is true: it is justifying the cost of an Origin by saying, If you want to make your system run six times faster, you can pay extra for some custom hardware. And given the moderate success of the Origin 2000 line, users seem to be agreeing with this philosophy.

One important thing to note about Beowulf clusters is that they are different from a NOW instead of being a computer lab where students can sit down and use any of the workstations individually, a Beowulf is a dedicated supercomputer with one point of entry. (This is actually something that the GRID book is wrong about pages 440-441 say that NOWs are dedicated. But the cited papers for NOW repeatedly state that they have the ability to migrate jobs away from workstations being used interactively.)

Both NOWs and Beowulfs are made of machines which have independent local memory spaces, but they go about presenting a global machine in different ways. A Beowulf uses kernel patches to pretend to be a multi-CPU machine with a single address space, whereas the NOW project uses GLUnix, which is a layer that sits above the system kernel, that loosely glues machines together by allowing MPI invocations to be scheduled and moved between nodes.

V. Myth

As the Avalon paper demonstrated, there are a lot of inflated expectations of what clusters can accomplish. Scanning through the forums of Slashdot, one can easily assess that there is a negative attitude prevailing towards vendor supplied supercomputers. Quotes like Everything can be done with a Beowulf cluster! and Supercomputers are dead are quite common. This reflects a naivet on the part of the technical public as a whole. There are two refutations to beliefs such as these:1) The difference between buying a supercomputer and making a cluster is the difference between repairing a broken window yourself or having a professional do it for you. Building a Beowulf cluster is a do-it-yourself supercomputer. It is a lot cheaper than paying professionals like IBM or Cray to do it for you but as a trade-off, you will have a lower reliability in your system because it is being done by amateurs. The Avalon paper tried to refute this by saying that they had over 100 days of uptime, but reading their paper carefully, one can see that only 80% of their jobs completed successfully. Why did 20% fail? They didnt know.

Holly Dail mentioned that the people that built the Legion cluster at the University of Virginia suffered problems from having insufficient air conditioning in their machine room. A significant fraction of the cost of a supercomputer is in building the chassis, and the chassis is designed to properly ventilate multiple CPUs running heavy loads. Sure, the Virginia people had a supercomputer for less than a real one costs, but they made up for it in hardware problems.

Businesses need high availability. 40% of IT managers interviewed by zdnet13 said that the reason that they were staying with mainframes and not moving to clusters of PCs is that large expensive computers have more stringent uptime guarantees. IBM, for example, makes a system that has a guaranteed 99.999% uptime which means that the system will only be down for fifty minutes during an entire year. Businesses cant afford to rely on systems like ASCI Blue, which is basically 256 quad Pentium Pro boxes glued together with a custom interconnect. ASCI Blue has never been successfully rebooted.

A large part of the cost of vendor-supplied machines is for testing. As a researcher, you might not care if you have to restart your simulation a few times, but a manger in charge of a mission-critical project definitely wants to know that his system has been verified to work. Do-it-yourself projects just cant provide this kind of guarantee. Thats why whenever a business needs repairs done on the building, they hire a contractor instead of having their employees do it for less.

3) Vendors are already doing it. It is a truism right now that Commercial, Off The Shelf (COTS) technology should be used whenever possible. People use this to justify not buying custom-built supercomputers. The real irony is that the companies that build these supercomputers are not dumb, and do use COTS technology whenever they can with the notable exception of Tera/Cray, who believe in speed at any price. The only times that most vendors build custom hardware is when they feel that the added cost will justify a significant performance gain.

For example, Blue Horizon, the worlds third most powerful computer, is built using components from IBM workstations: its CPUs, memory and Operating System are all recycled from their lower end systems. The only significant parts that are custom are the high performance file system (which holds 4TB and can write data in parallel very quickly), the chassis (which promotes reliability as discussed above), the SP switch (which is being used for backwards compatibility), the monitoring software (the likes of which cannot be found on Beowulf clusters) and the memory crossbar, which replaces the bus-based memory system found on most machines these days. By replacing the bus with a crossbar it greatly increases memory bandwidth and eliminates a bottleneck found in many SMP programs: when multiple CPUs try to hit memory at once, only one at a time can be served, causing severe system slowdown. Blue Horizon was sold to the Supercomputer Center for 20,000,000$, which works out to roughly 20,000$ a processor, an outrageously expensive price. But the fact that the center was willing to pay for it is testimony enough that the custom hardware gave it enough of an advantage over systems built entirely with COTS products.

VI Conclusion

Clustered computing is a very active field these days, with a number of good advancements coming out of it, such as Active Messages, Fast Messages, NOW, HPVM, Beowulf, etc. By building systems using powerful commodity processors, connecting them with high-speed commodity networks using Active Messages and linking everything together with a free operating system like linux, one can create a machine that looks, acts and feels like a supercomputer except for the price tag. However, alongside the reduced price comes a greater risk of failure, a lack of technical support when things break (NCSA has a full service contract with SGI, for example), and the possibility that COTS products wont do as well as custom-built ones.

A few people have created a distinction between two different kinds of Beowulf clusters. The first, Type I Beowulf, is built entirely with parts found at any computer store: standard Intel processors, 100BaseT Ethernet and PC100 RAM. These machines are the easiest and cheapest to buy, but are also the slowest due to the inefficiencies common in standard hardware. The so-called Type II Beowulf is an upgrade to Type I Beowulfs they add more RAM than can be commonly found in PCs, they replace the 100BaseT with some more exotic networking like Myrinet, and they upgrade the OS to use Active Messages. In other words, they replace some of the COTS components with custom ones to achieve greater speed.

I hold forth the view that traditional supercomputers are the logical extension of this process, a Type III Beowulf, if you will. Blue Horizon, for example, can be thought of as 256 IBM RS/6000 workstations that have been upgraded with a custom chassis and memory crossbar instead of a bus. Just like Type II Beowulfs, they replace some of the COTS components with custom ones to achieve greater speed. Theres no reason to call for the death of supercomputers at the hands of clusters; in some sense, the vendors have done that already. http://www.netlib.org/benchmark/top500/top500.list.html

http://now.cs.berkeley.edu/Case/case.html

http://www.sgi.com/origin/images/hypercube.pdf

file://ftp.cs.berkeley.edu:/ucb/CASTLE/Active_Messages/hotipaper.ps

http://www.usenix.org/publications/library/proceedings/ana97/full_papers/rodrigues/rodrigues.ps

http://now.cs.berkeley.edu/NowSort/nowSort.ps

http://www.cs.berkeley.edu/~rmartin/logp.ps

http://www-csag.ucsd.edu/papers/hpvm-siam97.ps

http://www-csag.ucsd.edu/projects/hpvm/doc/hpvmdoc_7.html#SEC7

http://www-csag.ucsd.edu/papers/myrinet-fm-sc95.ps

http://www-csag.ucsd.edu/papers/fm-pdt.ps

http://slashdot.org/articles/older/00000817.shtml

http://www.zdnet.com/zdnn/stories/news/0,4586,2341316,00.html

http://www.wired.com/news/technology/0,1282,14450,00.html

http://www.cnn.com/2000/TECH/computing/04/13/cheap.super.idg/index.html

http://stonesoup.esd.ornl.gov/

http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html

http://cnls.lanl.gov/avalon/

http://www.spec.org/osg/cpu95/results/res98q3/cpu95-980914-03070.html

http://www.spec.org/osg/cpu95/results/res98q1/cpu95-980206-02411.html

http://www.slashdot.org/search.pl, search for Beowulf

Documents

Bill