The PlayStation 3 for High Performance Scientiﬁc Computingcscads.rice.edu/PS3.pdfThe main control unit of the CELL processor is the POWER Processing Element (PPE). The PPE is a 64-bit,

The PlayStation 3 for High Performance Scientific Computing

Jakub KurzakDepartment of Computer Science, University of Tennessee

Alfredo ButtariFrench National Institute for Research in Computer Science and Control (INRIA)

Piotr LuszczekMathWorks, Inc.

Jack DongarraDepartment of Electrical Engineering and Computer Science, University of TennesseeComputer Science and Mathematics Division, Oak Ridge National LaboratorySchool of Mathematics & School of Computer Science, University of Manchester

The heart of the Sony PlayStation 3, theSTI CELL processor, was not originally in-tended for scientific number crunching, and thePlayStation 3 itself was not meant primarily toserve such purposes. Yet, both these items mayimpact the High Performance Computing world.This introductory article takes a closer look atthe cause of this potential disturbance.

The Multi-Core Revolution

In 1965 Gordon E. Moore, a co-founder of Intel madethe empirical observation that the number of tran-sistors on an integrated circuit for minimum compo-nent cost doubles every 24 months [1, 2], a state-ment known as Moore’s Law. Basically, since itsconception Moore’s Law has been subject to predic-tions of its demise. Today, 42 years later, a Googlesearch for the phrase ”Moore’s Law demise” returnsan equal number of links to articles confirming anddenying the demise of Moore’s Law. On the otherhand, Rock’s Law, named for Arthur Rock, statesthat the cost of a semiconductor chip fabricationplant doubles every four years, which questions thepurpose of seeking performance increases purely in

technological advances.Aside from the laws of exponential growth, a

fundamental result in Very Large Scale Integration(VLSI) complexity theory states that, for certain tran-sitive computations (in which any output may dependon any input), the time required to perform the com-putation is reciprocal to the square root of the chiparea. To decrease the time, the cross section mustbe increased by the same factor, and hence the totalarea must be increased by the square of that fac-tor [3]. What this means is that instead of build-ing a chip of area A, four chips can be built of areaA/4, each of them delivering half the performance,resulting in a doubling of the overall performance.Not to be taken literally, this rule means that, never-theless, nec Hercules contra plures - many smallerchips have more power than a big one, and multipro-cessing is the answer no matter how many transis-tors can be crammed into a given area.

On the other hand, the famous quote attributed toSeymour Cray states: ”If you were plowing a field,which would you rather use: two strong oxen or 1024chickens?”, which basically points out that solvingproblems in parallel on multiprocessors is non-trivial;a statement not to be argued. Parallel program-ming of small size, shared memory systems can be

1

done using a handful of POSIX functions (POSIXthreads), yet it can also be a headache, with noth-ing to save the programmer from the traps of dead-locks [4]. The predominant model for programminglarge systems these days is message-passing andthe predominant Application Programming Interface(API) is the Message Passing Interface (MPI) [5], alibrary of over 100 functions in the initial MPI-1 stan-dard [6] and 300 functions in the current MPI-2 stan-dard [7]. The real difficulty, of course, is not a givenprogramming model or an API, but the algorithmicdifficulty of splitting the job among many workers.Here Amdahl’s Law, by a famous computer archi-tect Gene M. Amdahl, comes into play [8], whichbasically concludes that you can only parallelize somuch, and the inherently sequential part of the al-gorithm will prevent you from getting speedup be-yond a certain number of processing elements. (Al-though, it is worth mentioning that Amdahl’s Law hasits counterpart in Gustafson’s Law, which states thatany sufficiently large problem can be efficiently par-allelized [9].)

Nevertheless, for both technical and economicreasons, the industry is aggressively moving towardsdesigns with multiple processing units in a singlechip. The CELL processor, primarily intended forthe Sony PlayStation 3, was a bold step in this direc-tion, with as many as nine processing units in one,relatively small, by today’s standards, chip. (234 Mtransistors versus 800 M transistors of POWER6 and1700 M transistors of dual-core Itanium 2). Unlikethe competition, however, the CELL design did notsimply replicate existing CPUs in one piece of sili-con. Although the CELL processor was not designedcompletely from scratch, it introduced many revolu-tionary ideas.

Keep It Simple - Execution

Throughout the history of computers, the proces-sor design has been driven mainly by the desire tospeed up execution of a single thread of control. Thegoal has been pursued by technological means - byincreasing the clock frequency, and by architecturalmeans - by exploiting Instruction Level Parallelism

(ILP). The main mechanism of achieving both goalsis pipelining, which allows for execution of multipleinstructions in parallel, and also allows for driving theclock speed up by chopping the execution into a big-ger number of smaller stages, which propagate theinformation faster.

The main obstacles in performance of pipelinedexecution are data hazards, which prevent simul-taneous execution of instructions with dependen-cies between their arguments, and control haz-ards, which result from branches and other instruc-tions changing the Program Counter (PC). Elabo-rate mechanisms of out-of-order execution, specu-lation and branch prediction have been developedover the years to address these problems. Unfortu-nately, the mechanisms consume a lot of silicon realestate, and a point has been reached at which theydeliver diminishing returns in performance. It seemsthat short pipelines with in-order execution are themost economic solution along with short vector Sin-gle Instruction Multiple Data (SIMD) capabilities.

Shallow Pipelines and SIMD

... shallower pipelines with in-order execution haveproven to be the most area and energy efficient.Given these physical and microarchitectural consid-erations, we believe the efficient building blocks offuture architectures are likely to be simple, modestlypipelined (5-9 stages) processors, floating pointunits, vector, and SIMD processing elements. Notethat these constraints fly in the face of the conven-tional wisdom of simplifying parallel programmingby using the largest processors available.

Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro,Joseph James Gebis, Parry Husbands, Kurt Keutzer,David A. Patterson, William Lester Plishker, John Shalf,Samuel Webb Williams, Katherine A. Yelick,”The Landscape of Parallel Computing Research: A Viewfrom Berkeley”,http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf

Superscalar processors, with multiple functionalunits can potentially schedule multiple independentinstructions at the same time. It is up to the hard-ware, however, to figure out which instructions are

2

independent, and the appropriate logic costs pre-cious transistors and corresponding space. Also,there are limits to what can be accomplished in hard-ware and, for the technology to perform well, appro-priate cooperation is required from the compiler side,as well as the programmer’s side. Short vector SIMDprocessing provides the capability of executing thesame operation in parallel on many data elements,and relies on the programmer and/or the compilerto extract the parallelism from the application. Bythe same token, it shifts the emphasis from extract-ing ILP in the hardware to extracting data parallelismin the code, which simplifies the hardware and putsmore explicit control in the hands of the programmer.Obviously, not all problems SIMD’ize well, but mostcan benefit from it in one way or another.

SIMD Extensions

Most CPUs today are augmented with some kind ofshort vector SIMD extensions. Examples include:

• x86 Streaming SIMD Extensions (SSE),

• PowerPC Velocity Engine / AltiVec / VMX,

• Sparc Visual Instruction Set (VIS),

• Alpha Motion Video Instructions (MVI),

• PA-RISC Multimedia Acceleration eXtensions(MAX),

• MIPS-3D Application Specific Extensions(ASP) and Digital Media Extensions (MDMX),

• ARM NEON.

Keep It Simple - Memory

When incurring a cache miss, a modern processorcan stall for as many as 1000 cycles. Sometimesthe request has to ripple through as many as threelevels of caches to finally make it to the utterly slowmain memory. When that task completes, there is lit-tle guarantee that the next reference does not missagain. Also, it is quite a remarkable fact that, whilethe gap between the cycle time and the memory ac-cess rate kept growing in the past two decades, the

size of the register file basically did not change fromthe original 32, since the conception of Reduced In-struction Set (RISC) processors in the late 70’s.

Since the cache mechanisms are automatic, theprogrammers have no explicit control over their func-tioning and as a result frequently reverse-engineerthe memory hierarchy in order to extract the perfor-mance. One way or another, it is a guessing game,which gives no guarantees, and there is little that canbe done when the miss happens other than waiting.Attempts to improve the situation use prefetchingqueues to fill the caches ahead of time, which some-what improves the situation. Nevertheless, modernprocessors deliver memory performance nowhereclose to their theoretical memory bandwidth, evenfor codes with ideal memory access pattern.

The Bucket Brigade

When a sequential program on a conventionalarchitecture performs a load instruction that missesin the caches, program execution now comes to ahalt for several hundred cycles. [...] Even with deepand costly speculation, conventional processorsmanage to get at best a handful of independentmemory accesses in flight. The result can becompared to a bucket brigade in which a hundredpeople are required to cover the distance to thewater needed to put the fire out, but only a fewbuckets are available.

H. Peter Hofstee,”Cell Broadband Engine Architecture from 20,000 feet”,http://www-128.ibm.com/developerworks/power/library/pa-cbea.html

The situation only gets worse when these limitedresources become shared by multiple processingunits put together in a single chip. Now, the limitedbandwidth to the main memory becomes a scarceresource, and in addition processors evict data fromeach other’s caches, not to mention phenomena likefalse sharing, where mutual evictions result from ac-cesses to disjoint memory locations. At this point,reverse-engineering the memory hierarchy becomesa tricky wizardry. The concept of a virtual memorysystem adds to the complexity of memory circuitry

3

and introduces the concept of a Translation Look-Aside Buffer (TLB) miss, which is in general muchmore catastrophic than a cache miss, since it trans-fers the control to the operating system.

If only the memory could have flat address spacewith flat and fast access rate and be under explicitcontrol of the program...

The CELL in a Nutshell

The main control unit of the CELL processor is thePOWER Processing Element (PPE). The PPE is a64-bit, 2-way Simultaneous Multi-Threading (SMT)processor binary compliant with the PowerPC 970architecture. The PPE consists of the POWER Pro-cessing Unit (PPU), 32 KB L1 cache and 512 KB L2cache. Although the PPU uses the PowerPC 970 in-struction set, it has a relatively simple architecturewith in-order execution, which results in a consider-ably smaller amount of circuitry than its out-of-orderexecution counterparts and lower energy consump-tion. The high clock rate, high memory band-width and dual threading capabilities should makeup for the potential performance deficiencies stem-ming from its in-order execution architecture. TheSMT feature, which comes at a small 5 percent in-crease in the cost of the hardware can potentiallydeliver from a 10 to 30 percent increase in perfor-mance [10]. The PPU also includes a short vectorSIMD engine, VMX, an incarnation of the PowerPCVelocity Engine or AltiVec.

The real power of the CELL processor lies in eightSynergistic Processing Elements (SPEs) accompa-nying the PPE. The SPE consists of the SynergisticProcessing Unit (SPU), 256 KB of private memoryreferred to as the Local Store (LS) and the Mem-ory Flow Controller (MFC) delivering powerful Di-rect Memory Access (DMA) capabilities to the SPU.The SPEs are short vector SIMD workhorses ofthe CELL. They possess a large 128-entry, 128-bitvector register file, and a range of SIMD instruc-tions that can operate simultaneously on two dou-ble precision values, four single precision values,eight 16-bit integers or 16 8-bit characters. Mostinstructions are pipelined and can complete one

vector operation in each clock cycle. This in-cludes fused multiplication-addition in single preci-sion, which means that two floating point operationscan be accomplished on four values in each clockcycle. This translates to the peak of 2×4×3.2 GHz =25.6 Gflop/s for each SPE and adds up to the stag-gering peak of 8×25.6 Gflop/s = 204.8 Gflop/s forthe entire chip.

All components of the CELL processor, includingthe PPE, the SPEs, the main memory and the I/Osystem are interconnected with the Element Inter-connection Bus (EIB). The EIB is built of four uni-directional rings, two in each direction, and a tokenbased arbitration mechanism playing the role of traf-fic light. Each participant is hooked up to the buswith the bandwidth of 25.6 GB/s, and the bus has in-ternal bandwidth of 204.8 GB/s, which means that,for all practical purposes, one should not be able tosaturate the bus.

The CELL chip draws its power from the fact thatit is a parallel machine including eight small, yetfast, specialized, number crunching, processing el-

4

ements. The SPEs, in turn, rely on a simple designwith short pipelines, a huge register file and a pow-erful SIMD instruction set.

Also, the CELL is a distributed memory systemon a chip, where each SPE possesses its pri-vate memory stripped out of any indirection mecha-nisms, which makes it fast. This puts explicit controlover data motion in the hands of the programmer,who has to employ techniques closely resemblingmessage-passing, a model that may be perceivedas challenging, but the only one known to be scal-able today.

Actually, the communication mechanisms on theCELL are much more powerful compared to thoseof the MPI standard. Owing to the global address-ing scheme, true one-sided communication can beutilized. At the same time, by its nature, the DMAtransactions are non-blocking, what facilitates realcommunication and computation overlapping, oftenan illusion in MPI environments.

CELL Vocabulary

• PPE - Power Processing Element

– PPU - Power Processing Unit

• SPE - Synergistic Processing Element

– SPU - Synergistic Processing Unit

– LS - Local Store

– MFC - Memory Flow Controller

• MIB - Element Interconnection Bus

• XDR - eXtreme Data Rate (RAM)

For those reasons the CELL is sometimes referredto as the supercomputer of the 80’s. Not surprisingly,in some situations, in order to achieve the highestperformance, one has to reach for the forgotten tech-niques of out-of-core programming, where the LocalStore is the RAM and the main memory is the disk.

The difficulty of programming the CELL is oftenargued. Yet, the CELL gives a simple recipe for per-formance - parallelize, vectorize, overlap; parallelizethe code among the SPEs, vectorize the SPE code,and overlap communication with computation. Yes,

those steps can be difficult. But if you follow them,performance is almost guaranteed.

The PlayStation 3

The PlayStation 3 is probably the cheapestCELL-based system on the market containing theCELL processor, with the number of SPEs reducedto six, 256 MB of main memory, an NVIDIA graphicscard with 256 MB of its own memory and a GigabitEthernet (GigE) network card.

Sony made convenient provisions for installingLinux on the PlayStation 3 in a dual-boot setup. In-stallation instructions are plentiful on the Web [11].The Linux kernel is separated from the hardware bya virtualization layer, the hypervisor. Devices andother system resources are virtualized, and Linuxdevice drivers work with virtualized devices.

The CELL processor in the PlayStation 3 is iden-tical to the one you can find in the high end IBM orMercury blade severs, with the exception that twoSPEs are not available. One SPE is disabled forchip yield reasons. A CELL with one defective SPEpasses as a good chip for the PlayStation 3. If allSPEs are non-defective, a good one is disabled. An-other SPE is hijacked by the hypervisor.

The PlayStation 3 possesses a powerful NVIDIAGraphics Processing Unit (GPU), which could poten-tially be used for numerical computing along with theCELL processor. Using GPUs for such purposes is

5

rapidly gaining popularity. Unfortunately, the hyper-visor does not provide access to the PlayStation’sGPU at this time.

The GigE card is accessible to the Linux kernelthrough the hypervisor, which makes it possible toturn the PlayStation 3 into a networked workstation,and facilitates building of PlayStation 3 clusters us-ing network switches and programming such instal-lations using message-passing with MPI. The net-work card has a DMA unit, which can be set up us-ing dedicated hypervisor calls that make it possibleto make transfers without the main processor’s inter-vention. Unfortunately, the fact that the communica-tion traverses the hypervisor layer, in addition to theTCP/IP stack, contributes to a very high messagelatency.

Programming

All Linux distributions for the PlayStation 3 come withthe standard GNU compiler suite including C (GCC),C++ (G++) and Fortran 95 (GFORTRAN), which nowalso provides support for OpenMP [12] through theGNU GOMP library. The feature can be used to takeadvantage of the SMT capabilities of the PPE. IBM’sSoftware Development Kit (SDK) for the CELL de-livers a similar set of GNU tools, along with an IBMcompiler suite including C/C++ (XLC), and more re-cently also Fortran (XLF) with support for Fortran 95and partial support for Fortran 2003. The SDK isavailable for installation on a CELL-based system,as well as an x86-based system, where code can becompiled and built in a cross-compilation mode, amethod often preferred by the experts. These toolspractically guarantee compilation of any existing C,C++ or Fortran code on the CELL processor, whichmakes the initial port of any existing software basi-cally effortless.

At the moment, there is no compiler available, ei-ther freely or as a commercial product, that will auto-matically parallelize existing C, C++ or Fortran codefor execution on the SPEs. Although such projectsexist, they are at the research stage. The code forthe SPEs has to be written separately and compiledwith SPE-specific compilers. The CELL SDK con-

tains a suite of SPE compilers including GCC andXLC. The programmer has to take care of SIMD vec-torization using either assembly or C language ex-tensions (intrinsics) and has to also parallelize the al-gorithm and implement the communication using theMFC DMA capabilities though calls to appropriate li-brary routines. This is the part that most program-mers will find most difficult, since it requires at leastsome familiarity with the basic concepts of parallelprogramming and short vector SIMD programming,as well as some insight into the low level hardwaredetails of the chip.

A number of programming models/environmentshave emerged for programming the CELL proces-sor. The CELL processor seems to ignite similar en-thusiasm in both the HPC/scientific community, theDSP/embedded community, and the GPU/graphicscommunity. Owing to that, the world of program-ming techniques proposed for the CELL is as di-verse as the involved communities and includesshared-memory models, distributed memory mod-els, and stream processing models, and representsboth data-parallel approaches and task-parallel ap-proaches [13–19].

Programming Environments

Origin Available FreeCELL Barcelona X XSuperScalar Supercomp.

CenterSequoia Stanford X X

UniversityAccelerated IBM X XLibraryFrameworkCorePy Indiana X X

UniversityMulti-Core Mercury XFramework Computer

SystemsGedae Gedae XRapidMind RapidMind XOctopiler IBMMPI Microtask IBM

6

A separate problem is the one of programmingfor a cluster of PlayStation 3s. A PlayStation 3cluster is a distributed-memory machine, and prac-tically there is no alternative to programming sucha system with message-passing. Although MPI isnot the only message-passing API in existence, itis the predominant one for numerical applications.A number of freely available implementations exist,with the most popular being MPICH2 [20] from Ar-gonne National Laboratory and OpenMPI [21–23],an open-source project being actively developed bya team of 19 organizations including universities, na-tional laboratories, companies and private individ-uals. Due to the compliance of the PPE with thePowerPC architecture, any of these libraries can becompiled for execution on the CELL, practically out-of-the-box. The good news is that using MPI on aPlayStation 3 cluster is not any different than on any

other distributed-memory system. The bad news isthat using MPI is not trivial, and proficiency takesexperience. Fortunately, the MPI standard has beenaround for more than a decade and a vast amountof literature is available in print and online, includingmanuals, reference guides and tutorials.

Overlapping communication and computation isoften referred to as the Holy Grail of parallel com-puting. Although the MPI standard is crafted withsuch overlapping in mind, due to a number of rea-sons it is often not accomplished on commodity com-puting clusters. It is, however, greatly facilitated bythe hybrid nature of the CELL processor. Since thePPE is not meant for heavy number crunching, it is-sues computational tasks to the SPEs. While thesetasks are in progress, the PPE is basically idle andcan issue communication tasks in the meantime.In principle, the role of the PPE is to issue bothcomputational tasks as well as communication tasksin a non-blocking fashion and check their comple-tion, which means that perfect overlapping can beachieved. Even if the message exchange involvesthe PPE to some extent, it should not disturb theSPEs proceeding with the calculations.

Scientific Computing

The PlayStation 3 has severe limitations for scientificcomputing. The astounding peak of 153.6 Gflop/scan only be achieved for compute-intensive tasks insingle precision arithmetic, which besides deliveringless precision, is also not compliant with the IEEEfloating point standard. Only truncation roundingis implemented, denormalized numbers are flushedto zero and NaNs are treated like normal numbers.At the same time, double precision peak is lessthan 11 Gflop/s. Also, memory-bound problems arelimited by the bandwidth of the main memory of25.6 GB/s. It is a very respectable value even whencompared to cutting-edge heavy-iron processors,but it does set the upper limit of memory-intensivesingle precision calculations to 12.8 Gflop/s and dou-ble precision calculation to 6.4 Gflop/s, assumingtwo operations are performed on one data element.

However, the largest disproportion in performance

7

of the PlayStation 3 is between the speed of theCELL processor and the speed of the Gigabit Ether-net interconnection. Gigabit Ethernet is not craftedfor performance and, in practice, only about 65 per-cent of the peak bandwidth can be achieved for MPIcommunication. Also, due to the extra layer of indi-rection between the OS and the hardware (the hy-pervisor) the incurred latency is as big as 200 µs,which is at least an order of magnitude below to-day’s standards for high performance interconnec-tions. Even if the latency could be brought downand a larger fraction of the peak bandwidth could beachieved, the communication capacity of 1 GB/s isway too small to keep the CELL processor busy. Acommon remedy for slow interconnections is runninglarger problems. Here, however, the small size of themain memory of 256 MB turns out to be the limit-ing factor. All together, even such simple examplesof compute-intensive workloads as matrix multiplica-tion cannot benefit from running in parallel on morethan two PlayStation 3s [24].

Only extremely compute-intensive, embarrass-ingly parallel problems have a fair chance of suc-cess in scaling to PlayStation 3 clusters. Such dis-tributed computing problems, often referred to asscreen-saver computing, have gained popularity inrecent years. The trend initiated by the SETI@Homeproject had many followers including the very suc-cessful Folding@Home project.

As far as single CELL processor is concerned,there is a number of algorithms where the float-ing point shortcomings can be alleviated. Someof the best examples can be found in dense lin-ear algebra for fundamental problems such as solv-ing dense systems of linear equations. If certainconditions are met, the bulk of the work can beperformed in single precision and exploit the sin-gle precision speed of the CELL. Double precisionis only needed in the final stage of the algorithm,where the solution is corrected to double preci-sion accuracy. Results are available for the gen-eral, non-symmetric, case where LU factorizationis used [25] and for the symmetric positive definitecase, where the Cholesky factorization is used [26].Performance above 100 Gflop/s has been reportedfor the latter case on a single PlayStation 3.

Mixed-Precision Algorithms

A(32), b(32) ← A, b

L(32), LT(32) ←Cholesky(A(32))

x(1)(32) ←back solve(L(32), L

T(32), b(32))

x(1) ← x(1)(32)

repeatr(i) ← b−Ax(i)

r(i)(32) ← r(i)

z(i)(32) ←back solve(L(32), L

T(32), r

(i)(32))

z(i) ← z(i)(32)

x(i+1) ← x(i) + z(i)

until x(i) is accurate enough

A - most computationally intensive single precisionoperation (complexity O(N3))

A - most computationally intensive double preci-sion operation (complexity O(N2))

Mixed-precision algorithm for solving a system of lin-ear equations using Cholesky factorization in singleprecision and iterative refinement of the solution inorder to achieve full double precision accuracy.

0 500 1000 1500 20000

40

80

120

Problem Size

Gflo

p/s

double precision peak

single precision algorithm(single precision results)

mixed precision algorithm(double precision results)

Performance of the mixed precision algorithmcompared to the single precision algorithm anddouble precision peak on a Sony PlayStation 3.

8

In cases where single precision accuracy is ac-ceptable, like signal processing, algorithms like fastFourier transform perform exceptionally well. Im-plementations of two different fixed-size transformshave been reported so far [27, 28], with one ap-proaching the speed of 100 Gflop/s [28, 29].

Future

One of the major shortcomings of the current CELLprocessor for numerical applications is the relativelyslow speed of double precision arithmetic. The nextincarnation of the CELL processor is going to in-clude a fully-pipelined double precision unit, whichwill deliver the speed of 12.8 Gflop/s from a singleSPE clocked at 3.2 GHz, and 102.4 Gflop/s from an8-SPE system, which is going to make the chip avery serious competitor in the world of scientific andengineering computing.

Although in agony, Moore’s Law is still alive, andwe are entering the era of billion-transistor proces-sors. Given that, the current CELL processor em-ploys a rather modest number of transistors of 234million. It is not hard to envision a CELL processorwith more than one PPE and many more SPEs, per-haps exceeding the performance of one TeraFlop/sfor a single chip.

Conclusions

The idea of many-core processors reaching hun-dreds, if not thousands, of processing elements perchip is emerging, and some voices speak of dis-tributed memory systems on a chip, an inherentlymore scalable solution than shared memory se-tups. Owing to this, the technology delivered by thePlayStation 3 through its CELL processor provides aunique opportunity to gain experience, which is likelyto be priceless in the near future.

References

[1] G. E. Moore. Cramming more components ontointegrated circuits. Electronics, 38(8), 1965.

[2] Excerpts from a conversation with GordonMoore: Moore’s Law. Intel Corporation, 2005.

[3] I. Foster. Designing and Building Parallel Pro-grams: Concepts and Tools for Parallel Soft-ware Engineering. Addison Wesley, 1995.

[4] E. A. Lee. The problem with threads. Computer,39(5):33–42, 2006.

[5] M. Snir, S. W. Otto, S. Huss-Lederman, D. W.Walker, and J. J. Dongarra. MPI: The CompleteReference. MIT Press, 1996.

[6] Message Passing Interface Forum. MPI:A Message-Passing Interface Standard, June1995.

[7] Message Passing Interface Forum. MPI-2:Extensions to the Message-Passing Interface,July 1997.

[8] G.M. Amdahl. Validity of the single proces-sor approach to achieving large-scale comput-ing capabilities. In Proceedings of the AFIPSConference, pages 483–485, 1967.

[9] J. L. Gustafson. Reevaluating Amdahl’s law.Communications of the ACM, 31(5):532–533,1988.

[10] IBM. Cell Broadband Engine ProgrammingHandbook, Version 1.0, April 2006.

[11] A. Buttari, P. Luszczek, J. Kurzak, J. J. Don-garra, and G. Bosilca. SCOP3: A Rough Guideto Scientific Computing On the PlayStation 3,Version 1.0. Innovative Computing Laboratory,Computer Science Department, University ofTennessee, May 2007.

[12] OpenMP Architecture Review Board. OpenMPApplication Program Interface, Version 2.5, May2005.

9

[13] P. Bellens, J. M. Perez, R. M. Badia, andJ. Labarta. CellSs: a programming model forthe Cell BE architecture. In Proceedings of the2006 ACM/IEEE SC’06 Conference, 2006.

[14] K. Fatahalian et al. Sequoia: Programming thememory hierarchy. In Proceedings of the 2006ACM/IEEE SC’06 Conference, 2006.

[15] IBM. Software Development Kit 2.1 Acceler-ated Library Framework Programmer’s Guideand API Reference, Version 1.1, March 2007.

[16] M. Pepe. Multi-Core Framework (MCF), Ver-sion 0.4.4. Mercury Computer Systems, Octo-ber 2006.

[17] M. McCool. Data-parallel programming on theCell BE and the GPU using the RapidMind de-velopment platform. In GSPx Multicore Appli-cations Conference, 2006.

[18] A. J. Eichenberger et al. Using advanced com-piler technology to exploit the performance ofthe Cell Broadband Engine architecture. IBMSys. J., 45(1):59–84, 2006.

[19] M. Ohara, H. Inoue, Y. Sohda, H. Komatsu, andT. Nakatani. MPI microtask for programming theCell Broadband Engine processor. IBM Sys. J.,45(1):85–102, 2006.

[20] Mathematics and Computer Science Division,Argonne National Laboratory. MPICH2 User’sGuide, Version 1.0.5, December 2006.

[21] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun,J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kam-badur, B. Barrett, A. Lumsdaine, R. H. Cas-tain, D. J. Daniel, R. L. Graham, and T. S.Woodall. Open MPI: Goals, concept, and de-sign of a next generation MPI implementation.In Proceedings of the 11th European PVM/MPIUsers’ Group Meeting, pages 97–104, Septem-ber 2004.

[22] R. L. Graham, T. S. Woodall, and J. M. Squyres.Open MPI: A flexible high performance MPI.In Proceedings of the 6th Annual International

Conference on Parallel Processing and AppliedMathematics, September 2005.

[23] R. L. Graham, G. M. Shipman, B. W. Bar-rett, R. H. Castain, G. Bosilca, and A. Lums-daine. Open MPI: A high-performance, hetero-geneous MPI. In Proceedings of the Fifth Inter-national Workshop on Algorithms, Models andTools for Parallel Computing on HeterogeneousNetworks, September 2006.

[24] A. Buttari, J. Kurzak, and J. J. Dongarra. La-pack working note 185: Limitations of thePlayStation 3 for high performance cluster com-puting. Technical Report CS-07-597, ComputerScience Department, University of Tennessee,2007.

[25] J Kurzak and J. J. Dongarra. Implementationof mixed precision in solving systems of lin-ear equations on the CELL processor. Concur-rency Computat.: Pract. Exper., 2007. in press,DOI: 10.1002/cpe.1164.

[26] J. Kurzak and J. J. Dongarra. LAPACK workingnote 184: Solving systems of linear equationon the CELL processor using Cholesky factor-ization. Technical Report CS-07-596, ComputerScience Department, University of Tennessee,2007.

[27] A. C. Chowg, G. C. Fossum, and D. A. Broken-shire. A programming example: Large FFT onthe Cell Broadband Engine, 2005.

[28] J. Greene and R. Cooper. A parallel 64K com-plex FFT algorithm for the IBM/Sony/ToshibaCell Broadband Engine processor, September2005.

[29] J. Greene, M. Pepe, and R. Cooper. Aparallel 64K complex FFT algorithm for theIBM/Sony/Toshiba Cell Broadband Engine pro-cessor. In Summit on Software and Algorithmsfor the Cell Processor. Innovative ComputingLaboratory, University of Tennessee, October2006.

10

Documents

The PlayStation 3 for High Performance Scientiﬁc Computingcscads.rice.edu/PS3.pdfThe main control unit of the CELL processor is the POWER Processing Element (PPE). The PPE is a 64-bit,