Introduction to the Cell
MultiprocessorJ. A. Kahle, M. N. Day, H. P. Hofstee, C. R.
Johns, T. R. Maeurer, and D. ShippyIBM Systems and Technology Group
IBM Journal of Research and Development
Vol. 49, No. 4/5, Pg. 589 (Jul-Sep 2005)
Presented by John IngallsECE 259 - April 8, 2010
Design Summary: PPE ISA: 64-bit IBM Power
Architecture with SIMD. 1 PPE, 8 SPEs, 1 memory and
1 I/O controller all on coherent bus (single address space).
PowerPE: 2-issue in-order 2-thread-SMT, 32KB L1 I$/D$, 512KB L2$ with software management hooks, 128-bit total SIMD width, separate Vector/SIMD issue queue from scalar execute.
Design Summary: SPE SynergisticPE: in-order
SIMD. 128-bit total width, like PPE.
Local Store (LS): 256KB, single port for either 128-bit SIMD-word access, or 128-byte insns fetch or DMA I/O.
128-entry regfile for static (compiler) insn reordering
area efficient: 15% control, rest is Execute & Local Store
Other Features
I/O supports direct connection to another Cell to easily build a cache-coherent multiprocessor.
Native binary compatibility with Power-ISA apps.
Modular design, but still fully custom. Extensive test and monitoring circuitry.
Programming Challenges:
SPE Local Store is software managed. Each SPE supports one thread context, and
context switches are expensive. Models:
Function Offload: function call from PPE Device Extension: SPE isolated, like a device Compute Acceleration: PPE aggregates SPE
results Streaming: each SPE is a step in software pipeline Shared Memory Multiprocessor: conventional Asymmetric Thread Runtime: p-threads
Good Paper is easy to
follow and doesn’t throw too much complicated stuff at reader.
Built and shipped on time by a joint venture of IBM, Sony, and Toshiba.
Many applications in media and supercomputing.
They keep listing static limitations imposed by their models as advantages, such as explicitly managed caches.
No hard performance data or comparison to competition. Only “anecdotal evidence” shows that it is possible to fully utilize Cell.
Bad
Conclusion / Questions Keywords:
Heterogeneous multi-core SIMD processor. Single address space across all cores on chip 1x conventional PPE for control. 8x SPEs for streaming SIMD are very fast and
power efficient if used. Several programming models are feasible.
Questions: How could the programming models be easier? What direction should this architecture grow
in?