View
0
Download
0
Category
Preview:
Citation preview
IBM Research
© 2008
Feeding theMulticore Beast:It’s All About the Data!
Michael PerroneIBM Master InventorMgr, Cell Solutions Dept.
IBM Research
© 20082 mpp@us.ibm.com
Outline
History: Data challenge
Motivation for multicore
Implications for programmers
How Cell addresses these implications
Examples
• 2D/3D FFT– Medical Imaging, Petroleum, general HPC…
• Green’s Functions– Seismic Imaging (Petroleum)
• String Matching– Network Processing: DPI & Intrusion Detections
• Neural Networks– Finance
IBM Research
© 20083 mpp@us.ibm.com
Chapter 1:
The Beast is Hungry!
IBM Research
© 20084 mpp@us.ibm.com
The Hungry Beast
Processor
(“beast”)
Data
(“food”)Data Pipe
Pipe too small = starved beast
Pipe big enough = well-fed beast
Pipe too big = wasted resources
IBM Research
© 20085 mpp@us.ibm.com
The Hungry Beast
Processor
(“beast”)
Data
(“food”)Data Pipe
Pipe too small = starved beast
Pipe big enough = well-fed beast
Pipe too big = wasted resources
If flops grow faster than pipe capacity…
… the beast gets hungrier!
IBM Research
© 20086 mpp@us.ibm.com
Move the food closer
Example: Intel Tulsa
– Xeon MP 7100 series
– 65nm, 349mm2, 2 Cores
– 3.4 GHz @ 150W
– ~54.4 SP GFlops
– http://www.intel.com/products
/processor/xeon/index.htm
Large cache on chip
– ~50% of area
– Keeps data close for
efficient access
If the data is local,
the beast is happy!
– True for many algorithms
IBM Research
© 20087 mpp@us.ibm.com
What happens if the beast is still hungry?
Data
Cache
If the data set doesn’t fit in cache
– Cache misses
– Memory latency exposed
– Performance degraded
Several important application classes don’t fit
– Graph searching algorithms
– Network security
– Natural language processing
– Bioinformatics
– Many HPC workloads
IBM Research
© 20088 mpp@us.ibm.com
Make the food bowl larger
Data
Cache
Cache size steadily increasing
Implications
– Chip real estate reserved for cache
– Less space on chip for computes
– More power required for fewer FLOPS
IBM Research
© 20089 mpp@us.ibm.com
Make the food bowl larger
Data
Cache
Cache size steadily increasing
Implications
– Chip real estate reserved for cache
– Less space on chip for computes
– More power required for fewer FLOPS
But…
– Important application working sets are growing faster
– Multicore even more demanding on cache than uni-core
IBM Research
© 200810 mpp@us.ibm.com
Chapter 2:
The Beast Has Babies
IBM Research
© 200811 mpp@us.ibm.com
Power Density – The fundamental problem
1
10
100
1000
1.5 1 0.7 0.5 0.35 0.25 0.18 0.13 0.1 0.07
i386i486
Pentium®
Pentium Pro ®
Pentium II ®
Pentium III®
W/cm2
Hot Plate
Nuclear Reactor
Source: Fred Pollack, Intel. New Microprocessor Challenges in the Coming Generations of CMOS Technologies, Micro32
IBM Research
© 200812 mpp@us.ibm.com
What’s causing the problem?
10S Tox=11AGate Stack
Gate dielectric approaching a
fundamental limit (a few atomic layers)
Po
wer
Den
sit
y (
W/c
m2)
65 nM
Gate Length (microns)
1 0.010.1
1000
100
10
1
0.1
0.01
0.001
Power, signal jitter, etc...
IBM Research
© 200813 mpp@us.ibm.com
1.0E+02
1.0E+03
1.0E+04
1990 1995 2000 2005 2010
Clo
ck S
peed
(M
Hz)
Clock Speed
103
102
104
Diminishing Returns on FrequencyIn a power-constrained environment, chip clock speed yields diminishing
returns. The industry has moved to lower frequency multicore architectures.
Frequency-DrivenDesignPoints
IBM Research
© 200814 mpp@us.ibm.com
Power vs Performance Trade Offs
Relative Performance
0
1
2
3
4
5
Rela
tive P
ow
er
1
1.45
1.3.85 1.7
We need to adapt our algorithms to
get performance out of multicore
IBM Research
© 200815 mpp@us.ibm.com
Implications of Multicore
There are more mouths to feed
– Data movement will take center stage
Complexity of cores will stop increasing
… and has started to decrease in some cases
Complexity increases will center around communication
Assumption
– Achieving a significant % or peak performance is important
IBM Research
© 200816 mpp@us.ibm.com
Chapter 3:
The Proper Care and Feedingof Hungry Beasts
IBM Research
© 200817 mpp@us.ibm.com
Cell/B.E. Processor: 200GFLOPS (SP) @ ~70W
IBM Research
© 200818 mpp@us.ibm.com
Feeding the Cell Processor
8 SPEs each with
– LS
– MFC
– SXU
PPE
– OS functions
– Disk IO
– Network IO
16B/cycle (2x)16B/cycle
BIC
FlexIOTM
MIC
Dual
XDRTM
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
64-bit Power Architecture with VMX
PPE
SPE
LS
SXU
SPU
MFC
PXUL1
PPU
16B/cycle
L232B/cycle
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
IBM Research
© 200819 mpp@us.ibm.com
Cell Approach: Feed the beast more efficiently
Explicitly “orchestrate” the data flow between main
memory and each SPE’s local store
– Use SPE’s DMA engine to gather & scatter data between
memory main memory and local store
– Enables detailed programmer control of data flow
• Get/Put data when & where you want it
• Hides latency: Simultaneous reads, writes & computes
– Avoids restrictive HW cache management
• Unlikely to determine optimal data flow
• Potentially very inefficient
– Allows more efficient use of the existing bandwidth
IBM Research
© 200820 mpp@us.ibm.com
Cell Approach: Feed the beast more efficiently
Explicitly “orchestrate” the data flow between main
memory and each SPE’s local store
– Use SPE’s DMA engine to gather & scatter data between
memory main memory and local store
– Enables detailed programmer control of data flow
• Get/Put data when & where you want it
• Hides latency: Simultaneous reads, writes & computes
– Avoids restrictive HW cache management
• Unlikely to determine optimal data flow
• Potentially very inefficient
– Allows more efficient use of the existing bandwidth
BOTTOM LINE:
It’s all about the data!
IBM Research
© 200821 mpp@us.ibm.com
Cell Comparison: ~4x the FLOPS @ ~½ the power
Both 65nm technology
(to scale)
IBM Research
© 200822 mpp@us.ibm.com
Memory Managing Processor vs. Traditional General Purpose Processor
IBM
AMD
Intel
Cell
BE
IBM Research
© 200823 mpp@us.ibm.com
Examples of Feeding Cell
2D and 3D FFTs
Seismic Imaging
String Matching
Neural Networks (function approximation)
IBM Research
© 200824 mpp@us.ibm.com
Feeding FFTs to Cell
Buffer
Input
Image
Transposed
Image
Tile
Transposed
Tile
Transposed
Buffer
SIMDized data
DMAs double buffered
Pass 1: For each buffer
• DMA Get buffer
• Do four 1D FFTs in SIMD
• Transpose tiles
• DMA Put buffer
Pass 2: For each buffer
• DMA Get buffer
• Do four 1D FFTs in SIMD
• Transpose tiles
• DMA Put buffer
IBM Research
© 200825 mpp@us.ibm.com
3D FFTs
Long stride trashes cache
Cell DMA allows prefetch
Single Element Data envelope
Stride 1
Stride
N2
N
IBM Research
© 200826 mpp@us.ibm.com
Feeding Seismic Imaging to Cell
(X,Y)
New G at each (x,y)
Radial symmetry of G reduces BW requirements
Data
Green’s Function
ij
jiyxGjyixD ),,,(),(
IBM Research
© 200827 mpp@us.ibm.com
Feeding Seismic Imaging to Cell Data
SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7
IBM Research
© 200828 mpp@us.ibm.com
Feeding Seismic Imaging to Cell Data
SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7
IBM Research
© 200829 mpp@us.ibm.com
Feeding Seismic Imaging to Cell
For each X
– Load next column of data
– Load next column of indices
– For each Y
• Load Green’s functions
• SIMDize Green’s functions
• Compute convolution at (X,Y)
– Cycle buffers
H
2R+1
1
Data buffer
Green’s Index buffer
(X,Y)
R
2
IBM Research
© 200830 mpp@us.ibm.com
Feeding String Matching to Cell
Find (lots of) substrings in (long) string
Build graph of words & represent as DFA
Problem: Graph doesn’t fit in LS
Sample Word List:
“the”
“that”
“math”
IBM Research
© 200831 mpp@us.ibm.com
Feeding String Matching to Cell
IBM Research
© 200832 mpp@us.ibm.com
Hiding Main Memory Latency
IBM Research
© 200833 mpp@us.ibm.com
Software Multithreading
IBM Research
© 200834 mpp@us.ibm.com
Feeding Neural Networks to Cell
Neural net function F(X)
– RBF, MLP, KNN, etc.
If too big for LS, BW Bound
N Basis functions: dot product + nonlinearity
D Input dimensions
DxN Matrix of parameters
Output
F
X
IBM Research
© 200835 mpp@us.ibm.com
Convert BW Bound to Compute Bound
Split function over multiple SPEs
Avoids unnecessary memory traffic
Reduce compute time per SPE
Minimal merge overhead
Merge
IBM Research
© 200836 mpp@us.ibm.com
Moral of the Story:It’s All About the Data!
The data problem is growing: multicore
Intelligent software prefetching
– Use DMA engines
– Don’t rely on HW prefetching
Efficient data management
– Multibuffering: Hide the latency!
– BW utilization: Make every byte count!
– SIMDization: Make every vector count!
– Problem/data partitioning: Make every core work!
– Software multithreading: Keep every core busy!
IBM Research
© 200837 mpp@us.ibm.com
Backup
IBM Research
© 200838 mpp@us.ibm.com
Abstract
Technological obstacles have prevented the microprocessor
industry from achieving increased performance through increased
chip clock speeds. In a reaction to these restrictions, the industry
has chosen the multicore processors path. Multicore processors
promise tremendous GFLOPS performance but raise the challenge
of how one programs them. In this talk, I will discuss the motivation
for multicore, the implications to programmers and how the
Cell/B.E. processors design addresses these challenges. As an
example, I will review one or two applications that highlight the
strengths of Cell.
Recommended