Sony/Toshiba/IBM (STI) CELL Processor - Utk

Sony/Toshiba/IBM (STI) CELL Processor

Scientific Computing for Engineers: Spring 2008

01/15/08 22:22 2

Nec Hercules Contra Plures

➢ Chip's performance is related to its cross section

same area2 performance(Pollack's Rule)

Cray's two oxen Cray's 1024 chicken

01/15/08 22:22 3

Three Performance-Limiting Walls

➢ Power Wall

Increasingly, microprocessor performance is limited by achievable power dissipationrather than by the number of available integrated-circuit resources (transistors and wires). Thus, the only way to significantly increase the performance of microprocessors is to improve power efficiency at about the same rate as the performance increase.

➢ Frequency Wall

Conventional processors require increasingly deeper instruction pipelines to achieve higher operating frequencies. This technique has reached a point of diminishing returns, and even negative returns if power is taken into account.

➢ Memory Wall

On multi-gigahertz symmetric multiprocessors – even those with integrated memory controllers – latency to DRAM memory is currently approaching 1,000 cycles. As a result, program performance is dominated by the activity of moving data between main storage (the effective-address space that includes main memory) and the processor.

01/15/08 22:22 4

... shallower pipelines with in-order execution have proven to be the most area and energy efficient. [...] we believe the efficient building blocks of future architectures are likely to be simple, modestly pipelined (5-9 stages) processors, floating point units, vector, and SIMD processing elements. Note that these constraints fly in the face of the conventional wisdom of simplifying parallel programming by using the largest processors available.

[...] David Patterson, [...] Kathy Yelick

The Landscape of Parallel Computing Research: A View from Berkeley

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html

Conventional Processors Don't Cut It

future architectures➢ shallow pipelines➢ in order execution➢ SIMD units

01/15/08 22:22 5

When a sequential program on a conventional architecture performs a load instruction that misses in the caches, program execution now comes to a halt for several hundred cycles. [...] Even with deep and costly speculation, conventional processors manage to get at best a handful of independent memory accesses in flight. The result can be compared to a bucket brigade in which a hundred people are required to cover the distance to the water needed to put the fire out, but only a few buckets are available.

H. Peter Hofstee

Cell Broadband Engine Architecture from 20,000 feet

http://www-128.ibm.com/developerworks/power/library/pa-cbea.html

Cache Memories Don't Cut It

conventional processor➢ bucket brigade

➢ a hundred people➢ a few buckets

01/15/08 22:22 6

Their (multicore) low cost does not guarantee their effective use in HPC. This relates back to the data-intensive nature of most HPC applications and the sharing of already limited bandwidth to memory. The stream benchmark performance of Intel's new Woodcrest dual core processor illustrates this point. [...] Much effort was put into improving Woodcrest's memory subsystem, which offers a total of over 21 GBs/sec on nodes with two sockets and four cores. Yet, four-threaded runs of the memory intensive Stream benchmark on such nodes that I have seen extract no more than 35 percent of the available bandwidth from the Woodcrest's memory subsystem.

Richard B. Walsh

New Processor Options for HPC

http://www.hpcwire.com/hpc/906849.html

Cache Memories Don't Cut It

conventional processor➢ latency-limited bandwidth

01/15/08 22:22 7

More flexible or even reconfigurable data coherency schemes will be needed to leverage the improved bandwidth and reduced latency. An example might be large, on-chip, caches that can flexibly adapt between private or shared configurations. In addition, real-time embedded applications prefer more direct control over the memory hierarchy, and so could benefit from on-chip storage configured as software-managed scratchpad memory.

[...] David Patterson, [...] Kathy Yelick

The Landscape of Parallel Computing Research: A View from Berkeley

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html

Conventional Memories Don't Cut It

future architectures➢ reconfigurable coherency➢ software-managed scrachpad memory

01/15/08 22:22 8

CELL

➢ multi-core➢ in-order execution➢ shallow pipeline➢ SIMD➢ scratchpad memory

➢ 3.2 GHz➢ 90 nm OSI➢ 234 million transistors

➢ 165 million – Xbox 360➢ 220 million – Itanium 2 (2002)➢ 1,700 million – Dual-Core Itanium 2 (2006) – 12.8 GFLOPS

(single or double)

➢ 204.8 GFLOPS single precision ➢ 204.8 GB/s internal bandwidth➢ 25.6 GB/s memory bandwidth

01/15/08 22:22 9

CELL

➢ PPE – Power Processing Element

➢ SPE – Synergistic Processing Element

➢ SPU – Synergistic Processing Unit

➢ LS – Local Store

➢ MFC – Memory Flow Controller

➢ EIB – Element Interconnection Bus

➢ MIC – Memory Interface Controller

01/15/08 22:22 10

Power Processing Element

➢ Power Processing Element (PPE)

➢ Power 970 architecture compliant

➢ 2-way Symmetric Multithreading (SMT)

➢ 32KB Level 1 instruction cache

➢ 32KB level 1 data cache

➢ 512KB level 2 cache

➢ VMX (AltiVec) with 32 128-bit vector registers

➢ standard FPU

➢ fully pipelined DP with FMA

➢ 6.4 Gflop/s DP at 3.2 GHz

➢ AltiVec

➢ no DP

➢ 4-way fully pipelined SP with FMA

➢ 25.6 Gflop/s SP at 3.2 GHz

01/15/08 22:22 11

Synergistic Processing Elements

➢ Synergistic Processing Elements (SPEs)

➢ 128-bit SIMD

➢ 128 vector registers

➢ 256KB instruction and data local memory

➢ Memory Flow Controller (MFC)

➢ 16-way SIMD (8-bit integer)

➢ 8-way SIMD (16-bit integer)

➢ 4-way SIMD (32-bit integer, single prec. FP)

➢ 2-way SIMD (64-bit double prec. FP)

➢ 25.6 Gflop/s SP at 3.2 Ghz (fully pipelined)

➢ 1.8 Gflop/s DP at 3.2 Ghz (7 cycle latency)

01/15/08 22:22 12

SPE

➢ SIMD architecture

➢ two in-order (dual issue) pipelines

➢ large register file (128 128-bit registers)

➢ 256 KB of scratchpad memory (Local Store)

➢ Memory Flow Controller to DMA code and data from system memory

01/15/08 22:22 13

Element Interconnection Bus

➢ Element Interconnection Bus (EIB)

➢ 4 16B-wide unidirectional channels

➢ half the system clock (1.6GHz)

➢ 204.8 GB/s bandwidth (arbitration)

01/15/08 22:22 14

EIB

➢ 16 byte channels

➢ 4 unidirectional rings

➢ token – based arbitration

➢ half system clock (1.6 GHz)

01/15/08 22:22 15

Main Memory System

➢ Memory Interface Controller (MIC)

➢ external dual XDR,

➢ 3.2 Ghz max effective frequency,

(max 400 MHz, Octal Data Rate),

➢ each: 8 banks max 256 MB,

➢ total: 16 banks max 512 MB,

➢ 25.6 GB/s.

01/15/08 22:22 16

In double precision

➢ every seven cycles each SPE can:

➢ process a two element vector,

➢ perform two operations on each element.

➢ in one cycle the FPU on the PPE can:

➢ process one element,

➢ perform two operations on the element.

8 x 2 x 2 x 3.2 GHz / 7 = 14.63 Gflop/s

2 x 3.2 GHz = 6.4 Gflop/s

21.03 Gflop/s

CELL Performance – Double Precision

01/15/08 22:22 17

In single precision

➢ in one cycle each SPE can:

➢ process a four element vector,


➢ in one cycle the VMX on the PPE can:

➢ process a four element vector,


8 x 4 x 2 x 3.2 GHz = 204.8 Gflop/s

4 x 2 x 3.2 GHz = 25.6 Gflop/s

230.4 Gflop/s

CELL Performance – Single Precision

01/15/08 22:22 18

Bandwidth:

➢ 3.2 GHz clock:

➢ each SPU – 25.6 GB/s, (compare to 25.6 Gflop/s per SPU)

➢ Main memory – 25.6 GB/s,

➢ EIB – 204.8 GB/s. (compare to 204.8 Gflop/s – 8 SPUs)

CELL Performance – Bandwidth

01/15/08 22:22 19

Connection Machine CM-5 (512 CPUs)➢ 512 x 128 = 65 Gflop/s DP

Playstation3 (4 units)➢ 4 x 17 = 68 Gflop/s DP

CELL Performance – Historical Perspective

01/15/08 22:22 20

Performance Comparison – Double Precision

1.6 GHz Dual-Core Itanium 2➢ 1.6 x 4 x 2 = 12.8 Gflop/s

3.2 GHz CELL BE (SPEs only)➢ 3.2 x 8 x 8 = 14.6 Gflop/s

01/15/08 22:22 21

Performance Comparison – Single Precision

1.6 GHz Dual-Core Itanium 2

➢ 1.6 x 4 x 2 = 12.8 Gflop/s

3.2 GHz SPE

➢ 3.2 x 8 = 25.6 Gflop/s

➢ One SPE = 2 Dual-Core Itaniums 2

3.2 GHz CELL BE (SPEs only)

➢ 3.2 x 8 x 8 = 204.8 Gflop/s

➢ One CBE = 16 Dual-Core Itaniums 2

01/15/08 22:22 22

Linux➢ x86➢ x86-64➢ PPC64➢ CELL BE

SDK - Platforms

➢ RPM-based distribution➢ Fedora Core recommended

➢ CELL plugins for Eclipse available

01/15/08 22:22 23

➢ SPU

➢ GCC

➢ G++

➢ XLC

➢ XLC++

➢ 32-bit

➢ assembler and linker are common to GNU and XL compilers

➢ XL compilers requite GNU tool chain for cross-assembling and cross‑linking for both the PPE and the SPE

➢ PPU and SPU are different ISAs – different sets of compilers

➢ PPU

➢ GCC

➢ G++

➢ GFORTRAN

➢ XLC

➢ XLC++

➢ XLF

➢ 32-bit

➢ 64-bit

➢ OpenMP

SDK - Compilers

01/15/08 22:22 24

➢ /opt/ibm/cell-sdk/prototype

➢ /lib

➢ FFT

➢ matrix

➢ game math

➢ audio resample

➢ curves and surfaces

➢ software managed cache

➢ .....

SDK - Samples

➢ /samples

➢ overlays

➢ using ALF

➢ simple DMA

➢ tutorial samples

➢ .....

➢ /workloads

➢ FFT

➢ MatMul

➢ .....

01/15/08 22:22 25

Compilation and Linking

➢ SPU objects are embedded in PPU objects➢ SPU code and PPU code are linked into one executable

➢ SDK provides standard makefile structure➢ /opt/ibm/cell-sdk/prototype

➢ make.env – compilation options➢ make.footer – build rules – do not modify➢ make.header – definitions – do not modify➢ README_build_env.txt – makefile howto

➢ understand the build process➢ run the default makefile➢ see what it does

01/15/08 22:22 26

Compilation and Linking

01/15/08 22:22 27

Hello World - libspe2

#include <libspe2.h>#include <pthread.h>

int main(){ spe_context_create() pthread_create() pthread_join() spe_context_destroy()}

void* spe_thread(){ spe_image_open() spe_program_load() spe_context_run() pthread_exit()}

➢ spe_context_run() is a blocking call➢ create one POSIX thread

for each SPE thread

PPEint main(){ // .....}

SPE

01/15/08 22:22 28

SPU Context Switching

quiesce the SPEquiesce the SPE

save privileged and problem state to CSAsave privileged and problem state to CSAsave low 16 K of LS to CSAsave low 16 K of LS to CSA

load and start SPE context-save sequenceload and start SPE context-save sequence

save GPRs and channel state to CSAsave GPRs and channel state to CSAsave LSCSA to CSAsave LSCSA to CSA

save 240 KB of LSsave 240 KB of LS to CSA to CSA

harvest (reset) an SPEharvest (reset) an SPE

load and start SPE context-restore sequenceload and start SPE context-restore sequence

copy LSCSA from CSA to LScopy LSCSA from CSA to LSrestore 240 KB of LSrestore 240 KB of LS from CSA from CSA

restore GPRs and channel state from LSCSArestore GPRs and channel state from LSCSA

restore privileged state from CSArestore privileged state from CSArestore remaining problem state from CSArestore remaining problem state from CSA

restore 16 KB of LS from CSArestore 16 KB of LS from CSA

01/15/08 22:22 29

CELL Programming Models / Environments

Gedae – commercial product

01/15/08 22:22 30

CELL Resources

➢ developerWorks CELL Resource Center http://www.ibm.com/developerworks/power/cell/

➢ Barcelona Supercomputer Center http://www.bsc.es/ Computer Sciences Linux on CELL

➢ Power.org CELL Developer Corner http://www.power.org/resources/devcorner/cellcorner/

➢ The CELL Project at IBM Research http://www.research.ibm.com/cell/

➢ CELL BE at IBM alphaWorks http://www.alphaworks.ibm.com/topics/cell/

➢ GA Tech CELL Workshop http://www.cc.gatech.edu/~bader/CellProgramming.html

➢ ICL CELL Summit http://www.cs.utk.edu/~dongarra/cell2006/

➢ CellPerformance http://www.cellperformance.com/

➢ CELL Broadband Engine Programming Handbook

➢ CELL Broadband Engine Programming Tutorial

➢ SPE Runtime Management Library

➢ C/C++ Language Extensions for CELL Broadband Engine Architecture

➢ SPU Assembly Language Specification

➢ Synergistic Processor Unit Instruction Set Architecture

Documents

Sony/Toshiba/IBM (STI) CELL Processor - Utk