Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
A Near-Memory Processor (NMP) for Vector, Streaming, and Bit Manipulation Workloads
Mi li W i M S i J T ll d R B tt
Streaming, and Bit Manipulation Workloads
Mingliang Wei, Marc Snir, Josep Torrellas, and R. Brett Tremaine
University of Illinois and IBMUniversity of Illinois and IBM
F b 2005February 2005
Wei/Snir/Torrellas: Near Memory Processing 1
Motivation• Memory system bandwidth is expensive
• Many scientific/multimedia apps– Are memory-bandwidth limited (caches hardly work)– Could use vectors
Manipulate bits– Manipulate bits
• Commodity processors not optimized for this!y p p
Wei/Snir/Torrellas: Near Memory Processing 2
Approach• Previous talk: NMP for prefetching
pp
• This design: NMP to off-load computation from main proc– Vector– Streaming
Bit Manipulation– Bit Manipulation
• Design is very simpleg y p
• Near Memory Processor (NMP): y ( )– Optimized for bandwidth (e.g. no caches)– Not necessarily physically closer to memory
Wei/Snir/Torrellas: Near Memory Processing 3
Near Memory Processor• NMP: Simple multithreaded proc with vector/stream/bit support
yp p pp
– High performance: tolerates high variability in:• Load/store latency of vector elements
R l i d f i i• Relative speed of communicating streams– Simple, compact design– Easy to programEasy to program– Fairly general purpose– Can be on the main processor chip or closer to memory
• Uses scratchpad: flexible memory area to store stream buffers, vectors and othervectors, and other
Wei/Snir/Torrellas: Near Memory Processing 4
Vectors, Streams and Multithreading• Vectors
, g
Have parallelismCaches work sub-optimally
• StreamsHave parallelismExploit producer/consumerSpace multiplexing the hardware is inefficient
Bl k d M l i h di• Blocked MultithreadingTolerates variability in LD/ST latency of vector elementsT l t i bilit i l ti d f i ti tTolerates variability in relative speed of communicating streamsSimple implementation, efficient hardware use
Wei/Snir/Torrellas: Near Memory Processing 5
Multithreading in the NMP
• Context switch when processor expects long idle time
g
Context switch when processor expects long idle time– Scratchpad miss– Stalled on synch (e.g. fast producer stream kernel)
• Only a few contexts needed• Do not save much state on context switch:
– Scratchpad is not saved
Wei/Snir/Torrellas: Near Memory Processing 6
System Architecturey
Wei/Snir/Torrellas: Near Memory Processing 7
NMP Architecture
• In-order core
Wei/Snir/Torrellas: Near Memory Processing 8
Core Architecture
Wei/Snir/Torrellas: Near Memory Processing 9
Scratchpad
• High-bandwidth local memory
p
High bandwidth local memory• Addressed with virtual addresses• Shared by all threads running on the NMPy g• Not saved on context switch• Very efficient synchronization: Full/Empty bit per bytey y p y p y• May suffer page misses• Pages are lazily paged out to memoryg y p g y
Wei/Snir/Torrellas: Near Memory Processing 10
Bit Manipulation
• Has Bit Matrix Register (BMR) of Cray
p
Has Bit Matrix Register (BMR) of Cray• Typical instructions:
– Bmm_load: load 64x64 bit matrix_– Bmm: bit multiply vector or scalar with the matrix in the BMR– Leadz
P t– Popcnt– Sshift– MixMix
Wei/Snir/Torrellas: Near Memory Processing 11
Other Issues
• Interface with Main Processor (MP)Interface with Main Processor (MP)– Start: MP stores into a mem-mapped location (no syscall)– End: NMP sets memory flag; MP polls
• Exceptions in NMP– Virtual memory exceptions:
• Not precise but restartable and handled in SWEach Scratchpad byte has a bit that records exceptions– Each Scratchpad byte has a bit that records exceptions
– Vector and stream buffers in Scratchpad cannot cross pages
Wei/Snir/Torrellas: Near Memory Processing 12
Programming Environment
• NMP is attached to main program via system call
g g
p g yNMP_handler = NMP_connect (ObjectAddress)Status = NMP_disconnect (NMP_handler)
• NMP is invoked from main program via asynchronous (user space) method invocation
P t t t NMP “d b ll” d ll fl– Processor stores parameters at NMP “doorbell” and polls flagStatus = Memthread_create (FunctionInvoked, Params,
CompletionFlag, NMP_handler)Memthread_wait (CompletionFlag) or Memthread_poll (CompletionFlag)
– NMP sets completion flagMemthead end (CompletionFlag)Memthead_end (CompletionFlag)
– Fits X10 Asynchronous expressions and futures
Wei/Snir/Torrellas: Near Memory Processing 13
Possible Programming Support
T i i l “h d d d” lib i f i l li i
g g pp
• Trivial: “hand-coded” libraries for special applications• Easy: libraries separately compiled by vectorizing compiler• Research: compiler that splits code into main-processor part
and NMP partC l i k t IBM d UIUC (Fl RAM)– Can leverage previous work at IBM and UIUC (FlexRAM)
Wei/Snir/Torrellas: Near Memory Processing 14
Architecture Evaluated
chipchip
Wei/Snir/Torrellas: Near Memory Processing 15
Simulation Parameters
Wei/Snir/Torrellas: Near Memory Processing 16
Simulator Infrastructure
• Architecture simulator from UIUC (SESC)Architecture simulator from UIUC (SESC)• Models ooo processors and MP memory hierarchies• 110 K lines of C++ source code• 110 K lines of C++ source code
Wei/Snir/Torrellas: Near Memory Processing 17
Applicationspp
• Code length: 730 lines/app (average)g pp ( g )
Wei/Snir/Torrellas: Near Memory Processing 18
Mapping of Applicationspp g pp
lity
highal
Loc
a
BMT
Temp
ora
ConvEncP tR di
3DESBT
Spat i al Local i t y
Rgb2YuvPartRadio
highSpat i al Local i t y high
Wei/Snir/Torrellas: Near Memory Processing 19
NMP Speedup Over Main Processor
NMP Speedup over Main Processor
N Speedup Ove a ocesso
18
20 V+S+B
12
14
16
B
8
10
12
S+B
2
4
6
VV V+S
0Rgb2yuv ConvEnc BMT BT 3DES PartRadio geo.M
Wei/Snir/Torrellas: Near Memory Processing 20
NMP Speedup Over Main Processor
Speedup over Main Processor
N Speedup Ove a ocesso
p p
18
20 V+S+B
14
16
nmpB
8
10
12nmpnovecnobitnomt
S+B
4
6
8none
V
S+B
V V+S
0
2
Rgb2yuv ConvEnc BMT BT 3DES PartRadio geo.M
V V+S
Wei/Snir/Torrellas: Near Memory Processing 21
ResultsResults
• Nmp much faster than main: geometric mean speedup of 4.9
• Vector support is key to the speedups in many vectorizable apps (Rgb2yuv, 3DES, and PartRadio)
• Bit manipulation support is key in some apps (BMT and BT)
• ConvEnc needs both vectorization and bit manipulation support
Wei/Snir/Torrellas: Near Memory Processing 22
Adding the Stream Appg pp
18
20
V+S+BV VV
12
14
16 B VV
VV
8
10
V
S+B
2
4
6 VV V+S
0Rgb2yuv ConvEnc BMT BT 3DES PartRadio Copy Scale Add Triad geo.M
Wei/Snir/Torrellas: Near Memory Processing 23
Adding the Stream Appg pp
16
18
20 V+S+B
V VV V
10
12
14nmp
novec
nobit
B V VV V
6
8
10 nobit
nomtnone
V
S+B
V
0
2
4
T
V V
Rgb2yu
v
ConvE
nc BMT BT
3DESPart
Radio
Copy
Scale Add
Triad
geo.M
Wei/Snir/Torrellas: Near Memory Processing 24
Conclusions
• Selected one NMP design point:Selected one NMP design point:– Off-load vector, streaming, bit manipulation ops– Simple/compact design– Fairly general purpose core
E al ated design ass ming NMP on processor chip• Evaluated design assuming NMP on processor chip
• High performance for high-bandwidth applications:High performance for high-bandwidth applications:– 4.9 x faster than main processor – Most cost-effective support in the NMP:
• Vectors• Bit manipulation
Wei/Snir/Torrellas: Near Memory Processing 25
Future Work
• Focusing on programmability:Focusing on programmability:– Programming interface– Compiler supportp pp
• Interaction with the OS
• Step back: How far can we go with PowerPC ISA (and smallStep back: How far can we go with PowerPC ISA (and small extensions)? – No vector compiler– Libraries that perform efficient bit manipulation
Wei/Snir/Torrellas: Near Memory Processing 26
Near Memory Processing (NMP)
Mi li W i M S i J T ll d R B tt
y g ( )
Mingliang Wei, Marc Snir, Josep Torrellas and R. Brett Tremaine
University of Illinois and IBMUniversity of Illinois and IBM
F b 2005February 2005
Wei/Snir/Torrellas: Near Memory Processing 27
RGB2YUV (Vector)G UV (Vecto )
Thread 1 Thread 2 Thread 3 Thread 4
… … … … … …
… …1 pixel
… … … … … …
… …•Input: 1000 x 200 pixels, output is the same size
C i f h i l i i d d t
1 pixel
•Conversion of each pixel is independent
•4 threads, each processes a partition of the input
•Each thread processes 16 pixels in parallel
Wei/Snir/Torrellas: Near Memory Processing 28
•Each thread processes 16 pixels in parallel
BMT (Bit manip)( p)
1024 bits T T T T TT T T T
T T T T
1024 bitsT T T T
T T T TT T T T
•Transpose 4 1024x1024 bit matrices
T
T T T T •4 threads, each trans. 1 matrix
•Threads use:T
TBmm_vec • Vector load to load data
•Bmm_vec to trans. a tile
Wei/Snir/Torrellas: Near Memory Processing 29A 64x64 bit tile •Vector store
3DES (Vector)( )
• Input: set of 64-bit word values. Output is same-size encrypted word values.
• Vectorized the load/store and the computation• Counter mode (more parallelism)( p )• 4 threads, each work on a partition of the input (like
Rgb2yuv)g y )• Each thread processes 16 elements in parallel• Bmm to perform initial and final permutation (though a p p ( g
small fraction of total execution time)
Wei/Snir/Torrellas: Near Memory Processing 30
ConvEnc (Vector + stream + bit)ConvEnc (Vector + stream + bit)
Stream buffer in Stream buffer in the scratchpad the scratchpad
Reader
Thread Encoder WriterThread Encoder
Thread
Writer
Thread
Input … …p
Stream
Output… …
•Input: bit stream; output: 2x bit stream
Stream
•Encode 64 bytes at a time
•Sshift instruction to shift 64 bytes of data
Mi i t ti t i t l t bl k f bitWei/Snir/Torrellas: Near Memory Processing 31
•Mix instruction to interleave two blocks of bits
BT (Stream + bit manip)( p)
Generate {Ai}{ i}{Bi} = take 5, drop 6, take 5 … from {Ai}{Ci} = drop 5, take 5, drop 6, take 5, drop 6 from {Ai}{Di } = (Ci or not(Bi+1)) xor (not (Ci) or Bi-1){Ei } = Di xor Di+37 xor Di+100Identify sequences of 0s in E that are longer than 100
•Three threads in the NMP: Generator, Splitter, Counter•Bmm to split the {Ai}
Generator Counter Splitter
{Bi}
{Ci}{Ai}
Stream BStream C
Thread ThreadThreadStream A
{Di}Stream D
Wei/Snir/Torrellas: Near Memory Processing 32
{ i}
PartRadio (Vector + stream)PartRadio (Vector + stream)
• Input: a float-point stream. Output is a float-point stream• 3 threads (like ConvEnc)( )
– Low pass filter– Demodulator – Equalizer
• Thread processes 16 elements in parallel (vector)
Wei/Snir/Torrellas: Near Memory Processing 33
Data Movement
L3 MemoryL2
ScratchpadScratchpad
Wei/Snir/Torrellas: Near Memory Processing 34