WHITE PAPER With Its New PowerXCell 8i Product …...blade-based, DLP acceleration technology of the type that IBM now offers with its new PowerXCell 8i product line and its QS22 blade

W H I T E P AP E R

W i t h I t s N e w P o w e r X C e l l 8 i P r o d u c t L i n e , I B M I n t e n d s t o T a k e Ac c e l e r a t e d P r o c e s s i n g i n t o t h e H P C M a i n s t r e a m Sponsored by: IBM

Richard Walsh Steve Conway Earl C. Joseph, Ph.D. Jie Wu August 2008

I D C O P I N I O N

Fifteen years ago, the high-performance computing (HPC) market started to abandon its data-parallel, vector architectural lineage and turned to commodity-priced scalar processors. One by one, the other custom components of HPC systems have been pushed aside in favor of cheaper, standards-based alternatives. With some notable exceptions, most HPC system component technologies have been mainstreamed, a change driven by the price-performance advantages offered by standards-based components engineered to serve volume markets. Nothing reflects this more strongly than the fact that standards-based cluster sales based on x86 microprocessors were responsible for over 65% of the revenue generated in the HPC market in 2007, up from just a 20% share in 2003.

However, blood is thicker than water, and the HPC user community has not forgotten where it came from or the fundamental data intensity of most HPC workloads. The mainstream x86 instruction set architecture (ISA) was not designed with HPC data-parallel requirements in mind and because of this has limited the sustained performance of many HPC applications. While processor clock speeds have recently stopped climbing, processor cores have multiplied, exacerbating this sustained performance shortcoming. There is little to suggest that HPC buyers will abandon the x86 mainstream and return to purchasing large numbers of custom data-parallel or vector systems to improve their applications' sustained performance, but the aim of IBM's PowerXCellTM 8i product line, with its single instruction, multiple data (SIMD) ISA and memory flow controller (MFC), is to bring data-parallel computing back to HPC and deliver higher sustained performance and power efficiency to HPC workloads with a processing engine supported by volume economics.

In IDC's opinion, IBM's new PowerXCell 8i processor and its go-to-market strategy have the potential to stimulate the return and mainstreaming of data-parallel processing to HPC. The key features of IBM's new PowerXCell 8i product line and its market strategy include:

! A single-chip, MFC-controlled, high memory bandwidth, shared memory design

! A double data rate (DDR2) memory subsystem and fully IEEE-compliant double-precision floating-point capabilities

! A broad range of PowerXCell 8i�based products configurable at a variety of scales

Glo

bal H

eadq

uarte

rs: 5

Spe

en S

treet

Fra

min

gham

, MA

0170

1 U

SA

P.

508.

872.

8200

F

.508

.935

.401

5

ww

w.id

c.co

m

2 #213691 ©2008 IDC

! A multitiered programming model with strong support among IBM's customers and partners

! Its use in the world's fastest computer, Los Alamos National Laboratory's (LANL) petascale supercomputer, Roadrunner

! A well-defined road map supported by volume economics from the gaming industry

I N T H I S W H I T E P AP E R

In this white paper, IDC reviews the state of the HPC market, its recent five years of very strong growth, the rise of standards-based clusters, and the growing importance of blades and custom-engineered enclosures. HPC buyer "pain points" related to memory bandwidth shortages, parallel programming of multicore processors, and power consumption are discussed, as is their potential to stimulate the more mainstream use of accelerators and data-parallel programming. Finally, the paper reviews IBM's PowerXCell 8i product line, multitiered programming environment, and some of its parallel programming software partnerships.

S I T U AT I O N O V E R V I E W

H P C ' s S t r o n g M a r k e t G r o w t h

The HPC market has shown rapid growth in the five years since 2002, especially when compared with the background rate of IT spending generally. HPC revenue had three years of double-digit growth between 2003 and 2005, followed by a still-impressive 9% year-over-year growth between 2005 and 2006. In 2007, despite a slowing economy, HPC revenue growth over 2006 was 15.5%, exceeding IDC estimates. Table 1 shows revenue growth over this period by competitive segment.

T A B L E 1

W o r l dw i d e H P C M a r k e t R e v en u e b y C o m p e t i t i v e S e g m e n t , 2 0 0 3 � 2 0 0 7 ( $ M )

Competitive Segment

Price Range 2003 2004 2005 2006 2007

2003�2007 CAGR (%)

Supercomputer >$500,000 2,401 2,631 2,881 2,567 2,983 5.6

Technical divisional

$250,000�499,999

544 977 1,197 1,420 1,781 34.5

Technical departmental

$100,000�249,999

947 1,117 2,561 3,323 4,193 45.1

Technical workgroup

$0�99,999 1,806 2,668 2,568 2,744 2,607 9.6

Total 5,698 7,393 9,208 10,055 11,563 19.4

Source: IDC, 2008

©2008 IDC #213691 3

Given the growing global interest in HPC technology as an essential component in national economic and technology strategies and the robust competition in the market, which continues to produce rapid innovation, IDC sees few major threats to a continued pattern of high growth in 2008 and beyond. In IDC's view, even a softening world economy should not greatly alter this forecast market growth because HPC's heavy R&D focus and longer buying cycles have largely insulated it historically from short-term economic downturns. IDC projects that HPC server revenue will increase at around a 9% CAGR through 2012 to reach almost $18 billion, up from under $6 billion in 2003 (see Table 2).

T A B L E 2

W o r l dw i d e H P C M a r k e t R e v en u e F o r e c a s t b y C o m p e t i t i v e S e gm en t , 2 0 0 8 � 2 0 1 2 ( $ M )

Competitive Segment

Price Range 2008 2009 2010 2011 2012

2008�2012 CAGR (%)

Supercomputer >$500,000 3,035 3,247 3,463 3,682 3,905 6.5

Technical divisional $250,000�499,999

2,102 2,427 2,755 3,086 3,420 12.9

Technical departmental

$100,000�249,999

4,801 5,400 5,990 6,570 7,140 10.4

Technical workgroup

$0�99,999 2,784 2,959 3,131 3,301 3,469 5.6

Total 12,723 14,033 15,339 16,639 17,934 9.0

Source: IDC, 2008

H P C C l u s t e r s F u e l M a r k e t G r o w t h

IDC's data show that the surge in HPC revenue has been fueled primarily by purchases of x86-based, Linux cluster systems priced below $500,000 (especially those priced under $250,000). This growth was sustained by MPI, a maturing, message-based parallel programming model. HPC workloads with largely partitionable data structures, already parallelized for custom massively parallel processing (MPP) and constellation systems, could be moved easily to clusters. Once there, input data sets could be grown to match the memory and bandwidth provided on the additional cluster nodes and allow for further scaling (so-called weak scaling). Even less scalable workloads benefited because more jobs with distinct inputs could be run simultaneously, increasing throughput, increasing the research and development iteration rate, and reducing time to solution. This process pushed the HPC price-performance curve sharply downward, creating a zero-gravity sensation and the expectation that performance should more than double in a technological generation while costing no more. This price-performance advantage and the other advantages that HPC buyers associate with clusters are presented in Figure 1.

4 #213691 ©2008 IDC

F I G U R E 1

C l u s t e r D r i v e r s : T o p R e a s o n s t o P u r c h a s e H PC C l u s t e r s

0 10 20 30 40 50 60 70

Other

To improve competitiveness

Improved capacity management

Total cost of ownership (TCO)

Ability to run larger problems

Ability to do new more/better science

Greater system throughput

Better price/performance

(Number of responses)

Source: IDC, 2008

As recently as early 2003, clusters accounted for just 20% of overall HPC server revenue. The dramatic penetration of the HPC market by clusters and their replacement of custom HPC systems through 4Q07 is shown in Figure 2. By the end of 2007, clusters had attained a 65% share of HPC server revenue. IDC sees clusters eventually topping out at about 80% of the HPC market, with the other 20% made up of systems that do not qualify as clusters, such as single-node servers, systems with symmetric multiprocessing (SMP) architectures, and MPP systems such as the IBM Blue Gene, Cray XT, and the SiCortex SC5832 that have too much custom content to fit the standards-based cluster definition.

©2008 IDC #213691 5

F I G U R E 2

W o r l dw i d e H i gh - P e r f o r m a n c e C o m pu t i n g R e v e n u e S h a r e b y S e r v e r T y p e , 1 Q 0 3 � 4 Q 0 7

0102030405060708090

1Q03

2Q03

3Q03

4Q03

1Q04

2Q04

3Q04

4Q04

1Q05

2Q05

3Q05

4Q05

1Q06

2Q06

3Q06

4Q06

1Q07

2Q07

3Q07

4Q07

(%)

ClusterNoncluster

Source: IDC, 2008

IDC sees cluster revenue growth and market penetration continuing and pushing down into entry-level systems. "Ease-of-everything" cluster offerings designed for the technical workgroup (systems selling for under $100,000) at smaller firms and in back-office locations are expected to show particularly strong growth in 2008 and beyond. However, the rapid acceptance of HPC cluster computing systems, separate compute nodes built from standard component technologies (x86 processors, commodity motherboards, standards-based networking technology, and primarily the Linux OS), will continue to cause disruptive changes in the HPC market. Such changes, challenges, new market requirements, and buyer "pain points" also define new market opportunities. The HPC market's growing interest in data-level parallelism (DLP) acceleration technology is just such an HPC market opportunity.

T h e C h a l l e n g e s o f H P C C l u s t e r s , B u y e r " P a i n P o i n t s , " a n d I B M ' s P o w e r X C e l l 8 i S o l u t i o n

It is perhaps stating the obvious that the overarching elements potentially missing from a cluster system assembled à la carte from commodity hardware and software components are integration and a balanced system design. Custom-built HPC systems are balanced to suit the HPC task and integrated to simplify its completion. Because of this, custom HPC systems have generally been able to achieve higher sustained performance on individual jobs and better overall utilization rates. As clusters have scaled out to very large node counts and scaled in to "fatter" nodes with much more processing power per rack unit, the intangibles of integration and balance have been deemphasized. The price-per-peak-performance and capital cost advantages of HPC clusters have, until recently, overwhelmed their operational

6 #213691 ©2008 IDC

drawbacks � system component imbalance and complexity, which limit sustained performance and lower overall system utilization rates. Lastly, the cluster revolution has placed cluster systems in many new environments, and their low cost has led to substantial growth in average node counts (a sixfold increase between 2004 and 2006 alone, according to IDC data). A consequence is that supplying basic operational inputs such as power, cooling, space, and support has become an important concern for HPC buyers.

Table 3 summarizes these and other HPC cluster buyer "pain points" and also indirectly presents the market requirements that support buyer interest in integrated, blade-based, DLP acceleration technology of the type that IBM now offers with its new PowerXCell 8i product line and its QS22 blade in particular.

T A B L E 3

H P C C l u s t e r B u y e r / U s e r " P a i n P o i n t s "

"Pain Point" Category "Pain Point" Particulars

Managing HPC cluster complexity

System installation, monitoring, upgrades

System administration, middleware

User and application support

Power, cooling, and space requirements Cluster price-performance drives down capital costs but drives up operating costs and resource use

Multicore, multisocket, multinode issues

Scheduling and programming complexity

Memory size and bandwidth inadequacy

Interconnect bandwidth, message rate mismatches

Server interconnect performance Latency, bandwidth, message rates, collectives performance

Storage system performance, data management

Total storage, file size, file number

Bandwidth, IOPs, reliability

Data staging, archiving

Parallel application coding and scaling issues Multicore, multisocket, multinode, accelerators

Third-party software costs Licensing models

Limited parallel price-performance, scaling

Better reliability, availability, and serviceability (RAS)

Extremely large-scale systems require new approaches

New production and operational environments require new approaches

New buyer requirements "Ease-of-everything" needs of new buyers

Source: IDC, 2008

©2008 IDC #213691 7

While the QS22 (and IBM's other PowerXCell 8i�based products) is presented in more detail below, it is important to note how its basic features respond to some key challenges facing today's HPC cluster buyers and users.

Blade-Based Design

First among these is the QS22's compact, integrated blade-based design. IBM and other HPC vendors with strong engineering skills have addressed the dilemma of providing integrated solutions while still using standards-based components by engineering dense, form-factor blades and their companion integrated enclosures. Blade sales are growing as a percentage of overall cluster sales. Blades and their enclosures provide vendors with the scope to engineer in value. This reduces cluster operating expenses and complexity while allowing the continued use of standards-based components that exploit volume-driven price-performance curves. The QS22, like other blade systems, reduces cluster management complexity and lowers power, cooling, and space requirements.

Fully IEEE-Compliant Double Precision

The feature of the QS22's PowerXCell 8i processor that is most unique and stands out against competition from graphics processing unit (GPU) accelerators is its pipelined, fully IEEE-compliant, double-precision processing capability. The QS22 contains two tightly coupled PowerXCell 8i processors that provide 2 x (1 + 8) = 18 cores. Sixteen of these are DLP, SIMD processors. Vector and other data-parallel architectures are known to be both bandwidth and power efficient, and IBM has exploited this principle and engineered its new PowerXCell 8i double-precision processors with a surprisingly small transistor count.

Low Latency and High Bandwidth Memory Access

The PowerXCell 8i's MFC and on-chip DDR2 memory controller make its memory large in size, low in first byte latency, and high in bandwidth. PowerXCell 8i's MFC supports DMA and blocked or vector-like memory operations among all the cores and main memory. These features relieve cluster buyer pain in the categories of power and cooling (the QS22 delivers large numbers of FLOPS per watt), multicore and multisocket bandwidth inadequacy (both its SIMD instruction set and sustained per-processor bandwidth help here), and even in the area of parallel application scalability, where the multicore, multisocket QS22 allows more parallel work to be done per node.

Reliability and Cost Effectiveness

Other problem areas faced by cluster buyers on certain applications that the QS22 could potentially address include high application licensing costs and improved reliability, availability, and serviceability (RAS). The more efficient parallel performance afforded by a DLP processor has the potential to reduce the number of application licenses required, and the highly integrated QS22 blade with 18 cores in a single form factor reduces the operating temperature per FLOP and the number of independent parts that could fail.

8 #213691 ©2008 IDC

IBM's PowerXCell 8i and QS22 blade are not entirely HPC cluster "pain point" positive. As with many other acceleration technologies, the PowerXCell 8i and QS22 introduce an additional layer of programming complexity because the Power Processor Element (PPE) and the Synergistic Processor Element (SPE) instruction sets are not x86 based and the programming model is not single binary. This issue has not been ignored by IBM and is a focal point of its effort to mainstream PowerXCell 8i acceleration technology. IBM's PowerXCell 8i programming models, its Software Development Kit (SDK) for multicore acceleration, and its application development partnerships in both the government and commercial sectors are intended to address programmability and are considered in more detail below.

T h e P r o m i s e a n d C h a l l e n g e o f A c c e l e r a t o r s

In the high-dimensional space (e.g., line width, clock speed, instruction set architecture, memory, and cache subsystem) that defines HPC processor microarchitecture, design themes have generally had a limited life span and alternatives have always persisted on the sidelines in service to particular application classes or special-purpose requirements (e.g., custom MPP and vector architectures). Changes in HPC market economics, technological breakthroughs, or barriers governing processor design can push such alternatives to the forefront and current approaches to the side. The HPC market's sharp change of course away from vector architectures to MPP systems in the mid-1990s is one example of this, and as presented earlier, the rapid replacement of these custom HPC MPP architectures by standards-based cluster systems is another.

As the era of clock-driven, superscalar, instruction-level parallelism (ILP) processor design has waned, the HPC market has entered another period of transition. Power dissipation considerations have forced chip designers to look at alternative forms of on-chip parallelism that provide performance acceleration without requiring so much power. Both thread-level parallelism (TLP) and DLP processor designs are being explored. They have been dubbed accelerators because in many cases they augment general-purpose performance from a separate bus or because they are simply not integral to the general-purpose processor instruction set. Today, the HPC market has multiple approaches to consider provided in multiple implementations. In addition to IBM's PowerXCell 8i processor, which is our focus here, the accelerator category includes FPGAs, GPUs, multicore and many-core processors, vector processors, many-threaded processors, and application-specific integrated circuits (ASICs).

While the variety of approaches in this category is large today, collectively they suggest an abstract or future architecture that includes many, probably simpler mixed-type processing elements, perhaps with field-programmable features (perhaps the on-chip interconnect, if not the cores themselves), and instructions that move streams (or vectors) of data onto the chip in a single issue. The common elements of these alternatives and the great incentive to unify and simplify the parallel programming model used to drive accelerator performance have stimulated investment in parallel programming software for accelerators at IBM for the PowerXCell 8i. This growth in investment and the potential future convergence of accelerator microarchitectures suggest a future of much improved price-, power-, and productivity-performance for HPC.

©2008 IDC #213691 9

IDC has been examining the accelerator category through market surveys, market forecasts, and technology analyses. With this analysis as a backdrop, the promise and challenges of accelerators are reviewed here, as are the specific concerns of potential buyers. This is provided as context within which to consider IBM's new PowerXCell 8i hardware and software product offerings.

The Promise of Accelerators

Crucial among all the factors that support the future use of accelerators in the era of HPC clusters is that today most of the alternatives are backed by volume economics. Intel's and AMD's multicore and future many-core processors obviously are. IBM's PowerXCell 8i is an HPC-specific modification of IBM's first-generation Cell Broadband Engine (Cell/B.E.) processor designed for the computer gaming market and the Sony PlayStation. GPUs have similar volume market support from the gaming industry. FPGAs are supported by volume purchases in the embedded signal processing space. Of the alternatives listed earlier, only vector processors and kernel-specific ASICs are without current volume economic support. Accelerator technologies that meet HPC's volume economic price-performance requirements have the best chance for success.

Accelerators, both TLP and DLP designs, also offer the prospect of improved memory bandwidth use and higher sustained performance � the former by hiding load latency underneath processor-ready work in other threads and the latter by parallel pipelining data streams from memory into the processor and back. Vector or DLP designs, such as the PowerXCell 8i, have a particular advantage for HPC workloads because of their natural data intensity. Another advantage that accelerators with heterocores or field-programmable cores offer over general-purpose processors is workload-specific functionality. They can be designed with only those functional units and/or the precision required by a particular class of HPC applications, or even that of an individual application kernel in the case of an ASIC.

Heterogeneous core chips, also called "chips with personality" (or programmability in the case of FPGAs), provide high use functionality and eliminate the general-purpose circuitry that consumes extra space and power. The scalar and vector processors that remain part of the Cray microarchitecture are perhaps the original examples of processors with personality. The heterogeneous design of the PowerXCell 8i is another example with a first-order division of labor and function (scalar and parallel) between the PPE and SPE cores on the chip. IDC expects that as line widths drop and as the number of cores per chip increases, the additional cores will offer an increasing variety of special-purpose functions.

Accelerators also have appeal because they can offer HPC datacenters efficiencies that deliver operational savings. Other things being equal, parallel systems, whether TLP or DLP, require less power to achieve the same level of performance and therefore run cooler and can be more densely packed. This allows fewer rackmounted units to provide the same performance using less power and leads to operational benefits in the current regime of scaled-out clusters.

Finally, the interest in acceleration technology in all its forms has stimulated community thinking about the parallel programming abstraction and promises more

10 #213691 ©2008 IDC

universal parallel programming language concepts and compilers that can produce code for the full variety of back-end parallel acceleration microarchitectures. IBM's investment in its SDK and its partnerships in the parallel software industry are significant efforts that take HPC in this direction. IBM also supports centers of expertise in academia (at the Barcelona Supercomputer Center, Georgia Tech, and the University of Maryland) to ensure that graduates in computer science and electrical engineering are exposed to current trends in computational science. The advantages presented earlier transfer in total to accelerators as a class, but only in part to each particular type of accelerator.

The Challenge of Accelerators

Substantial barriers remain to be overcome to mainstream acceleration technology, and Table 4 reminds the readers of these barriers. It also makes clear that while some challenges are general across the class, others apply only to specific types of accelerators. While all accelerators require extra programming effort to use, and IDC surveys place programming difficulty at the top of the list of accelerator challenges, FPGAs stand out as the most difficult to program, while single object vector processors are perhaps the easiest. IBM's PowerXCell 8i lands somewhere between. Most HPC workloads require or prefer double-precision floating point, but many of the alternatives today fall short in this category. FPGAs can be programmed with full IEEE 754 double-precision floating-point units, but these units consume large numbers of transistors, limiting the maximum performance per chip. Some GPU microarchitectures support the IEEE 754 double-precision format and meet some of its functional requirements; however, GPU vendors have avoided providing full double-precision capability because of its potential effect on performance. At this time, the Cray X2 vector processor and now the PowerXCell 8i heterogeneous multicore processor are the only fully IEEE 754 double-precision floating-point compatible HPC acceleration technologies available.

Continuing to work through Table 4, we note that those acceleration technologies designed as discrete components and that are accessed via an external bus must manage bandwidth limitations to the card and often have less memory than is available to the general-purpose processor on the motherboard. This is typically the case with GPUs and FPGAs. The PowerXCell 8i and custom vector processors both have the advantage of being able to address a unified, board-local memory space directly. Accelerators are typically less flexible than x86 architectures. GPUs can now handle more conditional data-parallel operations, but still have weaker integer performance. As noted earlier, FPGA floating-point capability is limited by the transistor count required to build these units.

Limited scalar processing power is often another issue. The scalar processors of both the Cray and IBM PowerXCell 8i cannot match that of a fast x86 or Power 6 core. GPUs are known to consume a lot of power, although not necessarily per peak FLOP. The growth in use of blades limits the number of practical accelerator choices, as accelerator products have not yet generally accommodated the increased use of blades (IBM's PowerXCell 8i QS22 blade is an exception). With respect to volume-price requirements, custom ASICs and other custom accelerated processing technologies with compelling performance features still do not meet broad HPC market price requirements.

©2008 IDC #213691 11

T A B L E 4

A c c e l e r a t o r U s e r " P a i n P o i n t s "

"Pain Point" Category "Pain Point" Particulars

Programming difficulties More difficult to program (especially FPGAs)

Adds another parallel programming layer

Requires dual object compiles

Requires algorithmic adjustments

Programming skill shortages

Insufficient precision, reliability Single precision only, non-IEEE conformant

No ECC in bus or memory

Continued bandwidth limitations Performance limited by external bus speeds

Card-local memory size limitations

Adds a layer to memory hierarchy

Poor instruction set support for memory operations

Inflexible architecture Inability to handle loop conditionals or asymmetric TLP

Lockstep parallelism/threads

Poor scalar (or integer or floating-point) performance

Limited portability, high risk Too many programming models for ISV support

Investment in climbing the learning curve could be lost

Consume too much power GPUs have high absolute power requirements

Wrong form factors Need blade-ready form factors

Too expensive Vector, ASIC, or too much custom content

Source: IDC, 2008

As noted earlier, the barriers to the widespread adoption of accelerator technology are significant. Some are generic to the entire category such as programming difficulty, and others are specific to individual accelerator types. The number of alternatives available is good news for the HPC market and gives buyers with specific needs choices. Many members of the HPC community are optimistic about accelerators in the longer term. One-third of those surveyed by IDC expected that accelerators would be very useful within a two- to three-year time frame, and another third believed that they would be at least somewhat useful. To quote one individual directly:

12 #213691 ©2008 IDC

These barriers are largely removable. The issue is the business case. Improvements will be gated by the providers' view of the size of the market opportunity and the rate at which providers of commodity microprocessors improve their product's performance for HPC workloads.

IDC expects that as milestones on the various accelerator road maps are reached (as they have been recently with the PowerXCell 8i processor from IBM), these barriers will be lowered.

The PowerXCell 8i Lowers Accelerator Barriers

Walking backwards through the list of accelerator pain points, we can evaluate the PowerXCell 8i's features with respect to each. IBM and its partners are offering the PowerXCell 8i in a greater variety of forms and at several more price points than its predecessor. Some have been designed and priced to compete with GPU accelerator card offerings. The IBM QS22 blade improves on the QS21 blade in that it contains dual-PowerXCell 8i processors and is among the first acceleration technologies available in dense blade form. The QS22 and the composite Triblade (which includes the QS22 blade) in the LANL's Roadrunner system are somewhat more power efficient than their QS21 predecessor, both in an absolute sense and on a per double-precision MFLOPS basis, and compare well with the competition.

PowerXCell 8i has some of programming difficulty, investment risk, and portability issues that are similar to those of other accelerator technologies. IBM's investments in the PowerXCell 8i programming environment to further reduce this barrier have continued since the release of the original Cell/B.E. With respect to flexibility, the PowerXCell 8i has some advantages. It offers fast integer and single- and double-precision floating-point performance, and the relative independence of its SPEs gives the PowerXCell 8i the ability to handle data-parallel conditionals as independent threads. As noted, the PowerXCell 8i adds high-speed double-precision to the single-precision speed of its predecessor. Both are IEEE 754 format compatible, although single-precision operations are not fully compliant with every element of the standard. All memory and buses on the PowerXCell 8i include ECC to meet HPC reliability standards, which is not the case with some accelerator alternatives.

Finally, the PowerXCell 8i's memory bandwidth, type, and size improvements make it much improved for HPC workloads over the original Cell/B.E. Its DDR2 memory is potentially large and directly addressable from the chip, avoiding some of the memory-related issues of bus-based accelerators. Its MFC unit extends PowerXCell 8i's data-parallel design out to memory with its DMA and blocked memory reference capabilities. All in all, the incremental improvements of the QS22 and PowerXCell 8i validate the optimism expressed by the HPC user in the preceding quote on the prospects for accelerators in HPC. While hurdles remain to be overcome before accelerators are fully integrated in the HPC mainstream, much has been done to make the QS22 and PowerXCell 8i more HPC friendly.

©2008 IDC #213691 13

I B M ' s N e w H P C A c c e l e r a t i o n P r o d u c t s : T h e P o w e r X C e l l 8 i P r o c e s s o r , t h e P o w e r X C e l l 8 i P X C A B C a r d , a n d I B M ' s B l a d e C e n t e r Q S 2 2

With the release of its PowerXCell 8i processor (65nm, SOI) and associated blades, accelerator cards, and systems, IBM offers the HPC market a range of third-generation PowerXCell 8i�based products, all with features that should significantly expand Cell/B.E.'s breadth of applicability in HPC and elsewhere. Important HPC-related improvements to its microarchitecture, additional form factors and features, improvements to its software development kit, and additional system offerings contribute to the PowerXCell 8i's expanded potential in HPC. This development at IBM is part of a broader pattern of change in the HPC market that has DLP acceleration technology (both hardware and software), supported by volume economics, potentially finding a place in the HPC mainstream.

New PowerXCell 8i Processor Retooled for HPC

While the PowerXCell 8i's lineage is clearly derived from the original graphics-oriented Cell/B.E. processor, its microarchitectural differences make it a new, HPC-specific branch off of that original Cell/B.E. line � still supported by the volume economics of Sony PlayStation sales, but tactically augmented for HPC. Like its predecessor, the PowerXCell 8i has one PPE and eight SIMD stream SPEs, giving the chip nine processors in all (see Figure 3). IBM's road map indicates that a PowerXCell 8i follow-on is planned for the 2010 time frame that will double the number of PPEs and quadruple its SPEs to 32 in a 45nm SOI process.

F I G U R E 3

I B M ' s T h i r d - G en e r a t i o n P o w e r X C e l l 8 i H e t e r o g en e o u s M u l t i c o r e P r o c e s s o r

Source: IBM, 2008

14 #213691 ©2008 IDC

First among the several important HPC-specific features designed into the new PowerXCell 8i is its enhanced double-precision (eDP) capability and performance. The double-precision units on earlier generation Cell/B.E. SPEs were not fully pipelined. On the PowerXCell 8i they are, and therefore each 3.2 GHz SPE delivers double-precision floating-point results seven times faster (one result per cycle) than its predecessor at a rate of 12.8 GFLOPS (3.2 GHz x 2 64-bit floating-point words x 2 64-bit floating-point operations [fused multiply-add]). This gives the eight SPEs per chip a combined double-precision peak performance of 102.4 GFLOPS or exactly one-half the chip's single-precision performance (~204.8 GFLOPS) � twice as many 32-bit, single-precision words (four versus two) fit in the SPE's 128-bit floating-point registers. IDC expects IBM to focus on the potential advantage in sustained performance per watt the PowerXCell 8i may have due to its single-chip architecture and unified, MFC-supported memory space. It is worth noting that this increase in double-precision performance comes without a substantial increase in transistor count, chip size, or thermal design power (TDP), which is listed at 92 watts for PowerXCell 8i.

Like the double-precision functional units in the first- and second-generation Cell/B.E. processors, the new double-precision functional units are fully IEEE 754 compatible in both format and function. The PowerXCell 8i's high-speed, single-precision floating-point units (designed more for graphics than for HPC applications) remain less than fully IEEE 754 floating-point compliant in function. However, fully compliant single-precision results can be generated by truncating double-precision runs, but these complete at double-precision rates, which are half that of native "graphics" single-precision rates. Double-precision floating-point capability is now also available from other acceleration technologies, but typically without full IEEE compliance. This feature of the PowerXCell 8i is one of several that distinguish it from other accelerators.

Equal in importance to the PowerXCell 8i's eDP capability is its redesigned on-chip memory controller, which addresses a larger, more standard DDR2-based memory subsystem. The previous-generation Cell/B.E. processor is based on a Rambus XDR memory architecture, which is bandwidth rich but limited in per-board memory capacity to values that are substantially lower than typical HPC applications require. The PowerXCell 8i is designed to preserve the memory bandwidth of the older Cell/B.E. (25.6 GBps per chip or .25 bytes per double-precision FLOP) while offering greater memory capacity. A consequence is that the dual, 128-bit (plus parity) memory buses of the new DDR2 memory controller increase the pin count of the PowerXCell 8i processor package, making it pin incompatible with older-generation Cell/B.E. processors.

The result is that PowerXCell 8i supports four DIMM slots and up to 16 GB of memory (more with future higher-density DIMMs) compared with the Cell/B.E.'s maximum of 1 GB. In addition, the PowerXCell 8i memory and memory bus subsystems are fully error corrected. Most of the remaining features of the PowerXCell 8i microarchitecture match those of the earlier Cell/B.E. version of the chip, but we remind the reader of 256 KB local store associated with each SPE. This is a DMA-enabled, memory-mapped, local memory with none of the transistor-demanding features of a full-blown cache to which the PowerXCell 8i SPEs can asynchronously pipeline data to and from memory or other SPE local stores with the help of each SPE's MFC. The local store's size, 16- and 128-byte blocked loads, and large outstanding memory reference queue are key features in the PowerXCell 8i's bandwidth profile.

©2008 IDC #213691 15

PowerXCell 8i in a PCIe Card Form Factor, IBM's PXCAB Card

Positioned and priced to compete with GPUs offered in standard PCIe form factors, IBM's PXCAB card is a double-wide, PCIe 16x card offered with custom packaging and labeling to OEMs for use in rackmounted units that might also accept GPU accelerator cards from NVIDIA or ATI. The PXCAB card includes one PowerXCell 8i processor, up to 8 GB of DDR2 memory on card, and two 1 Gigabit Ethernet ports. It functions more as a standalone component than a typical GPU accelerator. It runs the Linux operating system and communicates with the board's general-purpose processor via the PCIe bus using Ethernet emulation. This compact card retains the same advantages as IBM's other PowerXCell 8i products, including a large directly addressable memory that is error corrected, good double-precision performance per watt, and support for the components in IBM's SDK.

IBM's QS22 Brings PowerXCell 8i Performance to the Cluster

IBM's BladeCenter QS22 uses the same form factor as the older QS21 and the other blade-based offerings from IBM (see Figure 4). IBM's BladeCenter H chassis accepts 14 of the QS22 blades (or QS21 or other IBM blades), and sites with QS21 blades can add or upgrade to the QS22. The QS22 is a full-height blade and includes two 3.2 GHz PowerXCell 8i processors coherently connected with IBM's BIF interface; up to 16 GB of DDR2 memory per processor; two BladeCenter, midplane-facilitated Gigabit Ethernet ports; room for an InfiniBand adapter, a SAS adapter, and I/O buffer memory; and support for IBM's SDK.

Peak single-precision performance per blade is 460 GFLOPS (2 x [PPE+SPE]) and double-precision performance per blade is 217 GFLOPS (again, 2 x [PPE+SPE]). This works out to 3.04 TFLOPS per chassis or 12.16 TFLOPS per rack for double-precision and 6.44 TFLOPS per chassis or 25.76 TFLOPS per rack for single precision. Linpack performance per QS22 blade has been measured at around 170 double-precision GFLOPS, which is about 80% of peak performance per blade.

16 #213691 ©2008 IDC

F I G U R E 4

I B M ' s Q S 2 2 B l a d e

Source: IBM, 2008

An examination of the QS22's power efficiency shows that a single blade consumes about 250 watts while running Linpack. A complete QS22 cluster running Linpack has been measured at 488 MFLOPS per watt. This heterogeneous multicore, data-parallel SIMD processor has very good MFLOPS-per-watt specs when compared with most general-purpose microprocessors, which typically have measured values under 300 MFLOPS per watt. GPU power efficiency is generally quoted with respect to the power consumed only by the card, and GPUs come out somewhat ahead of the QS22 when this is done; however, when the power consumed by the board supporting the GPU card is included, the results are much closer to equal. The deciding factor for efficiency for a particular application will be the sustained performance observed. IBM believes that the QS22's directly addressable memory with 2 x 25.6 GBps bandwidth, MFC-supported DMA engines, full IEEE compatibility, and coherent interchip interface will give it a double-precision, sustained-performance advantage over its competitors.

Like the QS21's processors, the QS22's PowerXCell 8i processors function as standalone, multicore processors, two to a board and coherently linked in a manner not dissimilar to a dual-socket Opteron board linked by HyperTransport. The Linux OS runs independently on the PPE core of each processor and manages the use of its eight SPEs. In this sense, a BladeCenter H enclosure fitted with QS22 blades is not a bus-accelerated cluster like those that add GPUs to a standard x86-based cluster system, but a cluster of tightly coupled heterogeneous, multicore, cc-NUMA PowerXCell 8i�based nodes.

©2008 IDC #213691 17

For scalar work, the performance of PowerXCell 8i's PPE core does not equal the performance of the latest Intel or AMD x86 scalar cores. Yet, the QS22's tightly coupled architecture and large mixed-core count promise better sustained performance than bus-accelerated cluster systems on certain HPC applications. While the QS22 offers a cc-NUMA, dual-socket architecture with 18 cores, the Triblade in the Roadrunner system IBM built for LANL has a bus-based design similar to that of GPU-based accelerators.

R o a d r u n n e r , I B M ' s H P C H y b r i d S y s t e m f o r L A N L : A M i l e s t o n e i n D e s i g n a n d P e r f o r m a n c e

The announcement by IBM on June 10, 2008, that the PowerXCell 8i�based supercomputer (LANL's Roadrunner) it had assembled at its Poughkeepsie, New York, facility had become the first computer to run the industry's standard Linpack benchmark at a sustained petaflop was an HPC milestone. While newswire attention has focused on reaching the petaflop goal (a quadrillion double-precision floating-point operations per second), from IDC's perspective, the milestone is really defined by several other important features of this achievement.

The Meaning of the Petaflop Milestone

The first is that a system based on components that are standards based and largely volume priced is now at the top of HPC's TOP500 list for the first time. The components include AMD Opteron dual-core processors, 4x DDR 20 Gbps InfiniBand interconnect, DDR2 memory, and the PowerXCell 8i heterogeneous, multicore, DLP acceleration engine. Quibbling about whether the PowerXCell 8i is standards based is acceptable (its Triblade is a custom enclosure), but its presence on the scene is clearly driven by volume economic trends and early investment by Sony, Toshiba, and IBM in a processing engine designed not for HPC, but game boxes, in this case the Sony PlayStation. As one might expect of such a high-end system, its standards-based components are custom integrated, but architecturally, it is an InfiniBand switched cluster with acceleration technology supported by volume economics.

The acceleration technology is the second important feature of the announcement. The fastest computer in the world is now accelerator based, and the acceleration technology has not just augmented the performance of its general-purpose microprocessors. It is the primary engine behind Roadrunner's sustained Linpack petaflop. The system's PowerXCell 8i processors offer 1,332 TFLOPS compared with only about 50 TFLOPS from the dual-core Opterons. It is also noteworthy that acceleration technology did not merely put the system into the top 10 or 20 places of the TOP500 list, but rather put it at the very top.

Finally, LANL's Roadrunner and IBM's other PowerXCell 8i�based products bring HPC and its highest-performing system back to its data-parallel roots. Linpack, a benchmark with significant cache-reuse potential, runs at 78% efficiency on Roadrunner, which has no L2 cache and only a modestly sized, user-programmed, 256 KB local memory. This is a reminder of how well vector and data-parallel microarchitectures suit typically data-intensive HPC workloads (and also perhaps that the Cray-2 had similarly sized local memory). PowerXCell 8i's simplified, in-order data-parallel design also offers the added benefit of a reduced transistor count and

18 #213691 ©2008 IDC

therefore lower power consumption per FLOP. The PowerXCell 8i has only 250 million transistors on its 65nm die. Intel's quad-core Harpertown has 410 million; AMD's quad-core Barcelona has 463 million; and NVIDIA's Tesla GPU has 681 million. On the Linpack benchmark, Roadrunner achieves about 437 MFLOPS per watt even while carrying the power consumed by the AMD Opteron part of the system's Triblade (as we saw earlier, the QS22 blade is still more power efficient at 488 MFLOPS per watt). This makes Roadrunner over 65 MFLOPS per watt more efficient than even IBM's Blue Gene/P system, which was at the top of the Green500 list in February 2008, and almost 200 MFLOPS per watt more efficient than the best unaccelerated, x86-based cluster system.

When taken together, the features of Roadrunner discussed here and of IBM's other PowerXCell 8i based products send a powerful message:

HPC clusters designed around standards-based components, but in custom enclosures that use DLP acceleration technology augmented by blocked memory reference hardware (the PowerXCell 8i's MFC), can provide both industry-leading double-precision performance and power efficiency.

IDC expects that the HPC community is paying close attention to this message.

Roadrunner's Design Elements

Unlike the QS22 cluster mentioned earlier, in LANL's Roadrunner, IBM features a bus-based architecture that places the PowerXCell 8i blade under the control of a dual-socket Opteron node and that is accessible through a HyperTransport-to-PCI Express (HT-to-PCIe) bridge bus. This design is the basis for IBM's Triblade that fit three to a standard IBM BladeCenter H chassis. The four-slot Triblade (see Figure 5) is currently available only as part of IBM's QS22/LS21-based Roadrunner system at LANL. It includes two QS22 accelerator blades, each with dual-socket, 3.2 GHz PowerXCell 8i boards and four slots for their own directly controlled, DDR2 board-local memory. The QS22 blades are connected to a single dual-socket, dual-core 1.8 GHz Opteron-based master node-blade called the LS21. They are connected through an HT-to-PCIe bridge-blade, which is sandwiched between them and gives the Triblade its quad-blade appearance. The two, 2 x 8x PCIe to 16x HT links provide 2 + 2 GB per second of bandwidth per QS22 PowerXCell 8i socket to the LS21. Both the LS21 and QS22 have four DIMM slots per socket. Currently, Roadrunner is configured with 8 GB of memory per Opteron socket and 4 GB per PowerXCell 8i socket for (8 x 2) + (4 x 4) = 32 GB of memory in total per Triblade and 80 TB for the entire system. Memory within each board (LS21 and QS22) is cc-NUMA integrated.

From a programming perspective, the dual-processor Opteron LS21 functions as the programmable node for MPI message-passing, while the two QS22 boards are programmed at a lower level using one of the components of IBM's SDK.

©2008 IDC #213691 19

F I G U R E 5

R o a d r u n n e r ' s C u s t o m I n t e g r a t e d T r i b l a d e

Source: IBM, 2008

Stepping back and looking at the larger design features, we see that LANL's Roadrunner combines 180 of these Triblade nodes and 12 I/O blades with a 288-port DDR InfiniBand switch into "connected units" (CUs). There are 17 CUs in the entire system, giving it 3,060 Triblade nodes for computation; a total of 6,120 dual-core Opteron chips (50 TFLOPS peak); and 12,240 PowerXCell 8i chips (1.33 PFLOPS peak). There is one Opteron core for each PowerXCell 8i chip. When its I/O and management nodes are included, Roadrunner contains 130,464 computational cores.

Roadrunner uses a two-tier, fat tree topology supported by standard 288-port, 20 Gbps DDR InfiniBand switches and network adapters. There is full bisection bandwidth within each CU, and half bisection bandwidth among the CUs. All of Roadrunner's interconnect cables are optical, and its bisection bandwidth is uniformly 3.5 TBps. Its 216 I/O nodes support an aggregate bandwidth of 432 GB per second to LANL's 2-plus petabyte high-performance global file system from Panasas. A schematic diagram of Roadrunner's two-tiered DDR InfiniBand-based fat tree and its interconnected CUs is presented in Figure 6.

20 #213691 ©2008 IDC

F I G U R E 6

R o a d r u n n e r ' s T w o - T i e r e d D D R I n f i n i B an d F a t T r e e

12 links per CU to each of 8 switches

12,240 Cell eDP chips ⇒⇒⇒⇒ 1.3 PF, 52 TB6,120 dual-core Opterons ⇒⇒⇒⇒ 50 TF, 28 TB


17 Cluster Units3,060 Compute Nodes

Eight 2nd-stage 288-port IB 4X DDR switches

Connected Unit (CU) cluster180 compute nodes w/Cells

12 I/O nodes

288-port IB 4x DDR 288-port IB 4x DDR

PCIe attachedCell blades I/O

296 racks3.9 MW

12 links per CU to each of 8 switches



17 Cluster Units3,060 Compute Nodes

Eight 2nd-stage 288-port IB 4X DDR switches

Connected Unit (CU) cluster180 compute nodes w/Cells

12 I/O nodes

288-port IB 4x DDR288-port IB 4x DDR288-port IB 4x DDR 288-port IB 4x DDR288-port IB 4x DDR288-port IB 4x DDR

PCIe attachedCell blades I/O

296 racks3.9 MW

Source: LANL, 2008

While our attention naturally turns to the details of Roadrunner's design as presented earlier and its sheer scale � it consumes 2.3 MWatts of power; has 130,000 cores; weighs 500,000 pounds; and will take 21 trucks to deliver � the applications that will be run on it and the new science they make possible should be our focus. As noted earlier, HPC users feel that the primary barrier to generating new science on accelerators, including IBM's PowerXCell 8i product line, will be their programmability. IBM and Los Alamos National Laboratory, RapidMind, Gedae, and others are investing heavily in the development of PowerXCell 8i's programming environment.

I n v e s t i n g i n P o w e r X C e l l 8 i P r o g r a m m a b i l i t y

The major challenge in improving the performance, productivity, and portability (the three Ps of HPC programming) of today's HPC applications is the ubiquity, variety, and growth of parallelism in HPC system architectures. At the high end, government labs are adapting or rewriting their key applications to take maximum advantage of ultraparallel HPC systems that now contain as many as 100,000 independent computational cores. At the low end, smaller businesses and corporate departments engaged in computational science and engineering can no longer count on clock-period performance improvements and must adopt and improve the parallel performance of their applications on multicore, multisocket servers and clusters to achieve the productivity that will keep them competitive.

Investment by government, business, and venture capital firms in technologies to improve the three Ps of HPC application programming has grown to respond to this challenge. Latency and bandwidth limitations, the multitiered memory hierarchy, synchronization bottlenecks, load balancing challenges, and recovery from failure are

©2008 IDC #213691 21

among the many factors that make today's ultraparallel programming problem a very difficult one even when every processing element runs the same instruction set. With all their potential benefits, the heterogeneous or hybrid multi�instruction set computing models that come along with most HPC acceleration technologies make this problem only more challenging.

As a heterogeneous chip multiprocessor (CMP), IBM's QS22 PowerXCell 8i hybrid architecture is positioned between the bus- or network-divided, dual-ISA approach of GPU acceleration and the vector-scalar, functional unit�integrated, single-ISA approach of Cray vector accelerators (and probably future designs from Intel and AMD). PowerXCell 8i's designers adopted the view that accelerators (and their instruction sets) should be allowed to evolve independently from their supporting general-purpose scalar cores but should be placed on the same chip and tightly coupled through a high-bandwidth interconnect and memory management system. This stems from IBM's long-term view that data-parallel acceleration is just an initial phase in a process in which accelerator cores will become more workload specific. PowerXCell 8i's parallel model is based on SIMD threads that are spawned from the general-purpose PPE onto the SPEs. These threads are supported with data delivered asynchronously from each SPE's DMA-enabled MFC units capable of as many as 128 outstanding 128-byte blocked, simultaneous memory references. IBM's on-chip division of labor separates the requirement of compiling and running threads for the distinct, general-purpose core from that for the accelerated cores while offering very high-bandwidth intercore communication and a shared memory space to connect them. As implemented, this approach has several advantages:

! This thread-based model is readily supported in the Linux kernel, and it gives IBM the flexibility to present a software-integrated programming environment while developing its acceleration and general-purpose hardware independently.

! It pushes the functions of the on-chip interconnect to center stage (especially in its support of distributed DMAs and thread synchronization) and has forced IBM to think hard about on-chip interconnect requirements that will need to be addressed in HPC's probable many-core future. PowerXCell 8i's division-of-labor design also maps naturally to the variety of parallel abstraction and programming models being developed within IBM and by external partners. This has already stimulated the development of multiple programming model alternatives for PowerXCell 8i and should ease the burden of porting PowerXCell 8i programs to other accelerated platforms.

! While it is more difficult to single-source compile to two distinct processors and instruction sets mediated by an interconnect, IBM's intention to produce such a compiler is visible in its working prototype and supported by successes already achieved in other contexts such as Partitioned Global Address Space (PGAS) compiler development in which locality and parallel extensions have been added to standard programming languages and subroutine libraries act as a mediation layer.

IBM and its PowerXCell 8i software development partners are working along these lines to lower the accelerated, parallel programming barrier that IDC has noted is the primary difficulty limiting the adoption of acceleration technology in HPC.

22 #213691 ©2008 IDC

PowerXCell 8i's Multitiered Programming Environment

The range of form factors and regimes for which the PowerXCell 8i processor is intended and IBM's goal of mainstreaming PowerXCell 8i acceleration in HPC present challenges to IBM's PowerXCell 8i software development team. Those programming PowerXCell 8i in government labs are sophisticated, performance-oriented, and determined to work close to the hardware to get maximum performance. On the other hand, users in the financial sector looking to accelerate the constantly changing computational kernels used to price derivatives are concerned most about ease of use, prototyping capability, and how the tools will play with their current development environments. Added to these different requirements is the fact the PowerXCell 8i can function as a standalone heterogeneous CMP or as a bus-resident accelerator.

To serve these extremes and points in between, IBM has invested substantially over the past five years in a multitiered programming environment that includes a native programming layer (tier 1), a library and programming framework assisted layer (tier 2), and a number of full-featured application development environments (tier 3). A significant part of the progress in building up the PowerXCell 8i programming environment has come from the close collaboration between IBM and the programming staff and subcontractors at Los Alamos National Laboratory. Table 5 presents some of the advantages, disadvantages, best-fit uses and application areas, and software components from IBM and its partners in each tier.

T A B L E 5

I B M ' s M u l t i t i e r e d P o w e r X C e l l 8 i P r o g r a m m i n g E n v i r o n m en t

Programming Tier Features

Native or Direct Programming

Framework-Assisted Programming

Development Environment Programming

Advantages Best possible performance; direct control of on-chip resources

Reduced development time; easy optimized library link-ins

Minimal development time; quick prototyping; parallel platform independence

Disadvantages More expertise and coding required

Performance may be lower; framework, library limits

Performance may be lower; application area specificity

Area of application Stable applications running on large HPC, real-time, or embedded systems; whole application or bus-based acceleration

First-time ports of most applications; bus-based acceleration

Applications to be run on multiple platforms; limited skill base for development

Components/examples SDK 3.0 � IBM XL C, C++, Fortran; SPE API; local store management routines; assembly visualizer, etc.

BLAS, LAPACK, FFT; Accelerated Library Framework (ALF); Data Communication and Synchronization Library (DaCS); Dynamic Application Virtualization (DAV); Micro-MPI, etc.

RapidMind Multicore Development Platform; Gedae; VSIPL; Mentor Graphics EDGE IDE; IBM's single-source OpenMP PowerXCell 8i compiler, etc.

Source: IDC, 2008

©2008 IDC #213691 23

PowerXCel l 8 i ' s Nat ive Programming Layer

At the base of PowerXCell 8i's programming environment is IBM's SDK 3.0, a Linux-based, native programming tool chain of compilers, debuggers, profilers, and libraries now in its third release. This includes products from IBM (its XL compilers), from third parties (Totalview), and from open source Linux software developers (GNU tools). In economizing on-chip hardware to make the PowerXCell 8i more watt and transistor efficient, IBM knew that it had to build more intelligence into its compilers and its runtime and support libraries to achieve performance. For the native programmer seeking optimal performance, this intelligence has been first provided as APIs to programmer callable libraries that control low-level PowerXCell 8i features involved in parallel programming, memory management, and on-chip communication to be used with IBM's C, C++, and Fortran compilers. In parallel, this API-related and general intelligence (e.g., scalar data alignment, SIMD parallelism extraction, and branch optimization) has been added to the compiler itself. Currently, objects for PPEs and SPEs are produced with separate compiles and are then linked to generate a single binary, but IBM's long-term goal is to provide single-source, optimized compilation for the dual-ISA, PowerXCell 8i processor. A portable, prototype, single-source compiler that expresses the PowerXCell 8i's heterogeneous, thread-based parallelism using OpenMP pragmas is under development.

Providing single-source compilation addresses what the market expects from the x86 compilers once acceleration hardware is added to the x86 core by Intel and AMD. It also responds directly to a primary pain point expressed by likely HPC accelerator users and buyers in IDC surveys. Still, by providing user-callable libraries for the most important performance-related functions, IBM recognizes the fact that compilers will never understand as much about a program as the author. This dual-track approach to native programming on the PowerXCell 8i is illustrated by alternative approaches to the management of the 256 KB local store on each SPE. A native programmer may choose to explicitly make get and put calls to the local store to minimize memory- latency related stalls or to rely on the convenience of the SPE compiler's built-in software cache capability. In the first case, the best possible performance is the goal, and in the second, reduced development time is the goal. The layered nature of the PowerXCell 8i programming environment is intended to let users start at and then move to the level of programming performance and convenience most suitable to their situation.

PowerXCel l 8 i ' s Framework/Library-Ass isted Programming Layer

Over the past five years, IBM and its partners have invested in the PowerXCell 8i's programming stack to produce frameworks and libraries that can be called from a variety of applications that remain otherwise largely unmodified. IBM expects most initial PowerXCell 8i porting and programming efforts to start in this layer. Library-layer resources are particularly useful in situations in which the PowerXCell 8i processor is to be used for bus-based acceleration. For traditional HPC applications (e.g., CAE, EDA, FSS), IBM has created a suite of numerical libraries parallelized and optimized for the PowerXCell 8i, including BLAS, LAPACK, FFTs, and a Monte Carlo library among others.

24 #213691 ©2008 IDC

Beyond traditional HPC numerical libraries, IBM provides frameworks supporting different parallel programming acceleration models. These include the task-oriented Accelerated Library Framework (ALF), which allows for the definition and distribution of thread-based quantities of work; the Data Communication and Synchronization Library (DaCS), which supports convenient data movement between PowerXCell 8i cores; and the Dynamic Application Virtualization (DAV) library, which is designed to allow C/C++, Java, and VBA-Excel programmers to accelerate application functions from Windows environments. Partners such as Mercury (Mercury Multicore Framework) and Platform (Symphony) have provided tools for this layer as well. Standard HPC parallel programming paradigms such as OpenMP and OpenMPI are also supported.

PowerXCel l 8 i ' s Ful l-Featured Development Env ironments

The top tier in the PowerXCell 8i programming environment is important in part because of the variety of markets the processor is intended to serve beyond traditional HPC. These include signal processing, graphics programming, video surveillance, and real-time operating system environments, among others. Full-featured integrated development environments (IDEs) such as VSIPL++, Gedae (originally signal processing); EDGE IDE (graphics programming); and the Workbench Development Suite (real-time OS programming) are a part of the application development culture in these non-HPC submarkets. All of these IDEs run on the PowerXCell 8i, but IBM also supports partners producing PowerXCell 8i IDEs for more traditional HPC submarkets to help manage the increase in HPC programming complexity stimulated by clusters, multicore parallelism, and now acceleration technology.

IDC sees the growth in interest in accelerated processing as the beginning of a trend to include more specialized processing units in HPC, which will add still more complexity to the HPC programming problem. These effects and the growing use of HPC by users less informed about parallel programming have created a greater interest in full-featured and specialized HPC application development environments. IBM is working with a number of partners in this space:

! RapidMind, which offers a parallel programming abstraction model and programming environment that can be used on a variety of parallel acceleration devices including PowerXCell 8i

! Platform Computing, whose Symphony application supports the design and distribution of applications (particularly in the financial services market) on cluster and grid architectures

! Simudyne, whose GeoLib software and development environment has been combined with the PowerXCell 8i multicore architecture to accelerate the oil and gas processing and delivery cycle

©2008 IDC #213691 25

A valued and typical feature of IDEs is the ability to quickly prototype and adapt programs to changing requirements. This is particularly important in the financial services markets where the applications underlying the pricing and risk profiling of derivatives must frequently be updated to respond to the competition and changing market conditions.

Finally, the push by IBM and others to provide a single-source compiler for the PowerXCell 8i should be mentioned in the context of full-featured environments. From the point of view of traditional HPC programmers, the option to use already understood and standard parallel programming models such as OpenMP and OpenMPI on heterogeneous core clusters with a single-source compiler meets their current full-featured programming environment expectations. Single-source compiler technology for acceleration is not new (Cray's vector compiler is one example), and when accelerators become part of the x86 core in the next 12 to 18 months at Intel and AMD, accelerated single-source x86 compilers will also be made available.

IDC's overall view of IBM's supporting software investment in PowerXCell 8i heterogeneous, multicore accelerated silicon is positive. It is substantial, it leverages multiple partnerships, and it is tiered to serve the variety of markets in which PowerXCell 8i is expected to play. IBM's software challenge will be to sustain its investment, educate and train its customers in its use and anatomy, and prune those branches of its programming development tree that are not driving the adoption of IBM's PowerXCell 8i technology.

A Look at IBM's PowerXCell 8i Programming Partners

RapidMind Supports Paral le l Programming Abstract ion on

PowerXCel l

Discussions in October 2005 following a conference attended by both IBM and RapidMind started what is now a three-year working relationship between the two companies. RapidMind has designed a data-parallel programming platform that can be used to support development for arguably any chip- or board-level parallel processing engine including GPUs, multicore processors, PowerXCell 8i processors, and so on. In this sense, the HPC programmer can take a working serial or distributed parallel application, add RapidMind's board-level data-parallel constructs, and proceed to compile the application for node-local parallel processing on any of the processor types mentioned earlier. It is a write-once, compile-and-run-many technology based on the idea that all parallel programmable hardware can be driven from RapidMind's common parallel programming abstraction.

RapidMind-adapted C++ code is compiled directly into IBM PowerXCell 8i assembly. RapidMind's parallel extensions offer reduced development time, when compared with programming the PowerXCell 8i natively, and a substantial fraction of the performance of natively programmed code. The initial version of RapidMind's PowerXCell back end leaned heavily on its prior work with GPUs, but over the past year, intelligence specific to the PowerXCell 8i processor has been added to the compiler. This includes taking advantage of capabilities that PowerXCell 8i hardware supports more naturally than GPU hardware, including DMA-directed block loads from shared memory, gather-scatter operations, and full IEEE double-precision, among other things.

26 #213691 ©2008 IDC

The collaboration has a produced a generally effective compiler that delivers a substantial percentage of native PowerXCell 8i code. The partnership has now turned its attention to improving the performance of key codes and benchmarks in IBM's target HPC submarkets, including seismic processing, financial services, aerospace, digital media, and medical imaging. Many of IBM's public demonstrations for these industries have been made possible by the RapidMind partnership, which has allowed submarket application kernels to be ported to the PowerXCell 8i quickly. RapidMind is an IBM PartnerWorld partner and as such receives early hardware and software releases from IBM.

PowerXCel l Proves to Be a Natural Platform for Gedae's Data-F low

Programming Env ironment

While Gedae, a spin-off of Lockheed Martin, has been around for almost two decades, its partnership with IBM to extend its tool's capabilities to the PowerXCell 8i is not quite two years old. Even so, the interaction has progressed rapidly to the point where today, IBM's PowerXCell 8i products (e.g., QS22 blade and PXCAB card) may all be used with Gedae's graphical, data-flow parallel programming environment. Gedae is a name that is not yet familiar to those in the HPC user community, but it has a strong presence in the aerospace signal processing and embedded systems markets where it is frequently used to create radar applications. As might be expected, it has been used as a tool to program custom, embedded FPGA-based systems in the past, but has now been augmented to take advantage of the PowerXCell 8i's data-flow computing capability.

Similar to many circuit-level programming tools used with FPGAs, Gedae uses a GUI based on optimized standard library function blocks to build its applications, which ultimately generate C code that is compiled with IBM's SDK. One of Gedae's strengths is that it supports a very large number of standard mathematical libraries as programmable function blocks and can use libraries provided by the user or other vendors. Programmers graphically pipe together their application's kernels, basic blocks, and functions using these libraries. Gedae can then take the entire program into view and design a pipelined data load, processing, and storage protocol to make best use of application data locality. Radar application development with this optimized library method has demonstrated good performance relative to the performance of natively programmed applications.

While Gedae has extended the reach of IBM's PowerXCell 8i processor into the signal processing world, IBM is working with Gedae to introduce its data-flow parallel programming technology to more traditional HPC markets such as financial services and seismic processing, as well as to get Gedae to think more broadly about cluster-based distributed parallel processing. This involves taking an inventory of what markets Gedae's current libraries map onto and what additional routines need to be provided. Gedae is also an IBM PartnerWorld partner and continues to receive early product releases. Although GUI-based data-flow programming models such as Gedae's have not received as much attention from traditional HPC buyers (except those working with FPGAs), the ongoing IBM-Gedae partnership provides a channel between the signal process and HPC markets that offers the prospect of reducing the parallel programming barrier for accelerators in a unique way while delivering high sustained performance.

©2008 IDC #213691 27

F U T U R E O U T L O O K

In delivering its new PowerXCell 8i heterogeneous multicore processor and product lines to market earlier this year, IBM has fulfilled its contractual obligations to LANL and met HPC user community expectations by recasting the original game-oriented, Cell/B.E. microarchitecture to more completely meet HPC application requirements. In the process, it has substantially compounded the momentum that has been building behind HPC accelerator technology by offering a mix of PowerXCell 8i supporting form factors at several price points, a variety of programming model alternatives, and high sustained IEEE, double-precision floating-point performance per watt. The PowerXCell 8i product line is now a bona fide, HPC-oriented branch off the original Cell/B.E. trunk running parallel to its game-oriented parent, but with its own road map and 45nm follow-on. IDC believes that in combination, IBM's PowerXCell 8i product announcements bode well both for IBM in the HPC accelerator space and for HPC acceleration technology in general. The future for accelerators includes not only continued challenges for IBM but also new opportunities for IBM made greater by executing on its PowerXCell development plans.

C H AL L E N G E S A N D O P P O R T U N I T I E S

C h a l l e n g e s

! IBM's investment in the PowerXCell 8i programming environment, both internally and through its partnerships, makes clear that the company understands the importance of lowering the programming barriers for acceleration technology, but there is more to the programming picture than this. The current period (characterized by HPC programming model experimentation and diversity) will eventually pass as the demands of the HPC user community and ISVs prune the tree of alternatives and settle on a small number of ways to express both node-distributed parallelism (MPI will continue to dominate here) and node-local parallel acceleration. Whatever these standard practices turn out to be (whether from IBM or elsewhere), IBM must ensure that they map easily onto PowerXCell 8i's underlying and evolving heterogeneous, multicore microarchitecture.

! Given the knowledge that both Intel and AMD are working on on-chip acceleration technologies that will be integrated into the x86 instruction set architecture, IBM will have to match and probably exceed the ease-of-programming standards eventually set by these x86 ISA competitors. This almost certainly includes optimized single-source compilation for on-chip acceleration with or without some type of source-level language support (e.g., OpenMP and PGAS). The challenge to IBM's compiler group will be to deliver high sustained performance on key applications by tightly coupling essential chip-level functionality to compiler intelligence even as the number and nature of the PowerXCell 8i's SPEs, on-chip interconnect, and MFC evolve.

28 #213691 ©2008 IDC

! As a heterogeneous, chip-level multicore processor, PowerXCell 8i smartly adds HPC-oriented heterogeneity to the chip, but at the same time begs the question of what the correct next design step should be: more identical SPEs, more workload-specific heterocores, a more powerful general-purpose core, and/or a more directly programmable interconnect? Making the right design choices in the context of multiple constraints is clearly a challenge. IBM's next-generation PowerXCell 8i microprocessor must continue to make sustained performance per watt a key design and marketing metric; it must preserve not only its distinction from the competition but also its volume economic base; it must respond to evolving HPC market (even submarket) requirements; and it must take full advantage of IBM's microarchitectural engineering prowess.

! The HPC market has been captured by standards-based, volume-priced component technologies in recent years, and this fact is expressed in the growth in market share of HPC clusters. Volume pricing will continue to play an important role in HPC. The future success of IBM's HPC-oriented PowerXCell 8i�based products depends on the volume demand for its game-oriented sibling and the underlying union of their manufacturing base. If future sales of the Sony PlayStation are poor, the prospects of IBM's PowerXCell 8i product line will be diminished. If PowerXCell 8i's manufacturing process requires too much that is unique to it alone, again its prospects will be diminished. The overarching constraint imposed by HPC's current volume-pricing regime means that any product price premiums must be clearly scaled and connected to added value and that IBM must take every opportunity to cross-pollinate its PowerXCell 8i development successes into its gaming markets and vice versa when either serves the other's market need.

! Delivering memory bandwidth to multicore processors is the challenge in HPC processor design today. IBM emphasizes the probable sustained performance advantage its single-chip, MFC PowerXCell 8i microarchitecture will have over competition from the bus-divided GPU-based acceleration products. Proving this in real-world applications with PowerXCell 8i's less hardware, more software�intensive design strategy and following up in its next-generation products will be very important to PowerXCell 8i's success. This is a problem of both getting the required data into the processor, which is the job of the MFC, and keeping it there as long as it can be used. The flexibility and programmability of the on-chip interconnect will play a central role in the second case. As the number and type of cores on-chip grow, the ability to flexibly interconnect them in service to a specific constellation of computational kernels (whether arranged in a heteropipeline, in a SIMD array, or some other way) will be important in keeping sustained performance high and watts consumed per MFLOP low.

©2008 IDC #213691 29

O p p o r t u n i t i e s

! IBM's success with its QS22/LS21 hybrid (LANL's Roadrunner) and its PowerXCell 8i product line announcements such as its QS22 blade further validate the notion that intranode and on-chip parallel acceleration technologies are a viable path to increase both HPC performance and power efficiency. In addition to Cray Inc., IBM is among the few vendors that deliver a fully integrated, data-parallel accelerated HPC system. The QS22 blade (and its custom cousin, the Triblade) offers the HPC market much of what surveys indicate it has been looking for in future acceleration products. They are blade dense, power efficient, and based on a volume economic model, and they offer the prospect of better sustained performance through their DLP- and MFC-supported, blocked memory operations. IBM's range of PowerXCell 8i products presents the company with an opportunity to use its accelerator market advantage to generate new sales.

! While IBM has successfully sold x86-based HPC products and has benefited from the cluster revolution described at the beginning of this paper, IBM is a unique company with HPC products (and research and development teams) that span the entire HPC market space. Its ability to profit from and challenge the dominant themes in the industry creates an options-based hedge against future market uncertainties. IBM has the opportunity to raise the profile of its non-x86 PowerXCell 8i HPC alternatives, particularly in space-, power-, and cooling-limited contexts. The company's multiple form factors allow it to augment systems in place (like a GPU accelerator card) or upgrade entire systems within current datacenter power, cooling, and space envelopes. This option should be particularly appealing to departmental and divisional HPC system buyers that wish to improve performance without spending money on datacenter upgrades.

! As IDC's accelerator market survey data illustrate, minimizing HPC accelerator programming barriers is an essential element limiting accelerator hardware adoption. IBM's investment in a variety of PowerXCell 8i parallel programming models and its view that this variety combines to define one unifying parallel programming abstraction put it in a position to both influence accelerator parallel programming models and tune its products to them as market wide working models are chosen. IBM's PowerXCell 8i�based acceleration products appear to have sufficient support from IBM's software investments to attract buyers.

C O N C L U S I O N

IBM's recent PowerXCell 8i�based product and performance announcements, including those for LANL's Roadrunner, the QS22 blade, and the PXCAB card, have pushed acceleration technology to the forefront of HPC news in 2008. By breaking the petaflop barrier to capture the top spot in the June 2008 TOP500 HPC system rankings and demonstrating very high performance (~300 TFLOPS) on several LANL codes, IBM has helped to remove lingering doubts that accelerators (data-parallel accelerators in particular) will be an important part of the future of HPC. As noted earlier, in Roadrunner, IBM combined standards-based components, industry-leading MFLOPS per watt, and acceleration technology to make its petaflop performance achievement possible. While Roadrunner establishes a presence for IBM's new

30 #213691 ©2008 IDC

PowerXCell 8i offerings at the high end of the market, IBM's QS22 blades and PXCAB cards extend the reach of the PowerXCell 8i processor to the broader HPC community with advantages that address some of the key accelerator adoption barriers laid out earlier, including:

! Higher sustained performance on key applications through a shared memory architecture supporting vector-like, blocked, and buffered DMA memory operations

! Processors that are fully IEEE, double-precision floating-point capable

! Very good double-precision performance per watt made possible by a hardware design concept that emphasizes HPC performance specificity and simplicity and that transfers some responsibility for performance to software

! A mix of dense form factors to relieve pressure on rack and floor space

! Larger memory configurations that approach those offered by standard, rackmounted, cluster motherboards

! Support for a mix of programming models from IBM and its partners in government, academia, and industry

Many of PowerXCell 8i's product features have been exercised by industry-leading scientists and programmers at LANL over the past two years to give the PowerXCell 8i hardware and software products a substantial in-the-field trial. PowerXCell 8i's market success represents a vote for building HPC systems with mixed-core CMP microarchitectures over the same-core alternatives.

The marketplace for accelerators remains complex and is far from reaching equilibrium. Still, this year's PowerXCell 8i product announcements by IBM are another significant event driving accelerators into the HPC mainstream and, when combined with the other HPC accelerator-related advances and announcements over the past 12 months, will perhaps mark 2008 as the year data-parallel accelerated HPC systems began to recapture part of the HPC market they once dominated.

C o p y r i g h t N o t i c e

External Publication of IDC Information and Data � Any IDC information that is to be used in advertising, press releases, or promotional materials requires prior written approval from the appropriate IDC Vice President or Country Manager. A draft of the proposed document should accompany any such request. IDC reserves the right to deny approval of external usage for any reason.

Copyright 2008 IDC. Reproduction without written permission is completely forbidden.

Documents

WHITE PAPER With Its New PowerXCell 8i Product …...blade-based, DLP acceleration technology of the type that IBM now offers with its new PowerXCell 8i product line and its QS22 blade