Download pptx - Introduction to the Cell Multiprocessor

Introduction to the Cell

MultiprocessorJ. A. Kahle, M. N. Day, H. P. Hofstee, C. R.

Johns, T. R. Maeurer, and D. ShippyIBM Systems and Technology Group

IBM Journal of Research and Development

Vol. 49, No. 4/5, Pg. 589 (Jul-Sep 2005)

Presented by John IngallsECE 259 - April 8, 2010

Design Summary: PPE ISA: 64-bit IBM Power

Architecture with SIMD. 1 PPE, 8 SPEs, 1 memory and

1 I/O controller all on coherent bus (single address space).

PowerPE: 2-issue in-order 2-thread-SMT, 32KB L1 I$/D$, 512KB L2$ with software management hooks, 128-bit total SIMD width, separate Vector/SIMD issue queue from scalar execute.

Design Summary: SPE SynergisticPE: in-order

SIMD. 128-bit total width, like PPE.

Local Store (LS): 256KB, single port for either 128-bit SIMD-word access, or 128-byte insns fetch or DMA I/O.

128-entry regfile for static (compiler) insn reordering

area efficient: 15% control, rest is Execute & Local Store

Other Features

I/O supports direct connection to another Cell to easily build a cache-coherent multiprocessor.

Native binary compatibility with Power-ISA apps.

Modular design, but still fully custom. Extensive test and monitoring circuitry.

Programming Challenges:

SPE Local Store is software managed. Each SPE supports one thread context, and

context switches are expensive. Models:

Function Offload: function call from PPE Device Extension: SPE isolated, like a device Compute Acceleration: PPE aggregates SPE

results Streaming: each SPE is a step in software pipeline Shared Memory Multiprocessor: conventional Asymmetric Thread Runtime: p-threads

Good Paper is easy to

follow and doesn’t throw too much complicated stuff at reader.

Built and shipped on time by a joint venture of IBM, Sony, and Toshiba.

Many applications in media and supercomputing.

They keep listing static limitations imposed by their models as advantages, such as explicitly managed caches.

No hard performance data or comparison to competition. Only “anecdotal evidence” shows that it is possible to fully utilize Cell.

Bad

Conclusion / Questions Keywords:

Heterogeneous multi-core SIMD processor. Single address space across all cores on chip 1x conventional PPE for control. 8x SPEs for streaming SIMD are very fast and

power efficient if used. Several programming models are feasible.

Questions: How could the programming models be easier? What direction should this architecture grow

in?