A Near-Memory Processor (NMP) for Vector, Streaming, and ... · • Near Memoryy() Processor (NMP): – Optimized for bandwidth (e.g. no caches) – Not necessarily physically closer

A Near-Memory Processor (NMP) for Vector, Streaming, and Bit Manipulation Workloads

Mi li W i M S i J T ll d R B tt

Streaming, and Bit Manipulation Workloads

Mingliang Wei, Marc Snir, Josep Torrellas, and R. Brett Tremaine

University of Illinois and IBMUniversity of Illinois and IBM

F b 2005February 2005

Wei/Snir/Torrellas: Near Memory Processing 1

Motivation• Memory system bandwidth is expensive

• Many scientific/multimedia apps– Are memory-bandwidth limited (caches hardly work)– Could use vectors

Manipulate bits– Manipulate bits

• Commodity processors not optimized for this!y p p


Approach• Previous talk: NMP for prefetching

pp

• This design: NMP to off-load computation from main proc– Vector– Streaming

Bit Manipulation– Bit Manipulation

• Design is very simpleg y p

• Near Memory Processor (NMP): y ( )– Optimized for bandwidth (e.g. no caches)– Not necessarily physically closer to memory


Near Memory Processor• NMP: Simple multithreaded proc with vector/stream/bit support

yp p pp

– High performance: tolerates high variability in:• Load/store latency of vector elements

R l i d f i i• Relative speed of communicating streams– Simple, compact design– Easy to programEasy to program– Fairly general purpose– Can be on the main processor chip or closer to memory

• Uses scratchpad: flexible memory area to store stream buffers, vectors and othervectors, and other


Vectors, Streams and Multithreading• Vectors

, g

Have parallelismCaches work sub-optimally

• StreamsHave parallelismExploit producer/consumerSpace multiplexing the hardware is inefficient

Bl k d M l i h di• Blocked MultithreadingTolerates variability in LD/ST latency of vector elementsT l t i bilit i l ti d f i ti tTolerates variability in relative speed of communicating streamsSimple implementation, efficient hardware use


Multithreading in the NMP

• Context switch when processor expects long idle time

g

Context switch when processor expects long idle time– Scratchpad miss– Stalled on synch (e.g. fast producer stream kernel)

• Only a few contexts needed• Do not save much state on context switch:

– Scratchpad is not saved


System Architecturey


NMP Architecture

• In-order core


Core Architecture


Scratchpad

• High-bandwidth local memory

p

High bandwidth local memory• Addressed with virtual addresses• Shared by all threads running on the NMPy g• Not saved on context switch• Very efficient synchronization: Full/Empty bit per bytey y p y p y• May suffer page misses• Pages are lazily paged out to memoryg y p g y


Bit Manipulation

• Has Bit Matrix Register (BMR) of Cray

p

Has Bit Matrix Register (BMR) of Cray• Typical instructions:

– Bmm_load: load 64x64 bit matrix_– Bmm: bit multiply vector or scalar with the matrix in the BMR– Leadz

P t– Popcnt– Sshift– MixMix


Other Issues

• Interface with Main Processor (MP)Interface with Main Processor (MP)– Start: MP stores into a mem-mapped location (no syscall)– End: NMP sets memory flag; MP polls

• Exceptions in NMP– Virtual memory exceptions:

• Not precise but restartable and handled in SWEach Scratchpad byte has a bit that records exceptions– Each Scratchpad byte has a bit that records exceptions

– Vector and stream buffers in Scratchpad cannot cross pages


Programming Environment

• NMP is attached to main program via system call

g g

p g yNMP_handler = NMP_connect (ObjectAddress)Status = NMP_disconnect (NMP_handler)

• NMP is invoked from main program via asynchronous (user space) method invocation

P t t t NMP “d b ll” d ll fl– Processor stores parameters at NMP “doorbell” and polls flagStatus = Memthread_create (FunctionInvoked, Params,

CompletionFlag, NMP_handler)Memthread_wait (CompletionFlag) or Memthread_poll (CompletionFlag)

– NMP sets completion flagMemthead end (CompletionFlag)Memthead_end (CompletionFlag)

– Fits X10 Asynchronous expressions and futures


Possible Programming Support

T i i l “h d d d” lib i f i l li i

g g pp

• Trivial: “hand-coded” libraries for special applications• Easy: libraries separately compiled by vectorizing compiler• Research: compiler that splits code into main-processor part

and NMP partC l i k t IBM d UIUC (Fl RAM)– Can leverage previous work at IBM and UIUC (FlexRAM)


Architecture Evaluated

chipchip


Simulation Parameters


Simulator Infrastructure

• Architecture simulator from UIUC (SESC)Architecture simulator from UIUC (SESC)• Models ooo processors and MP memory hierarchies• 110 K lines of C++ source code• 110 K lines of C++ source code


Applicationspp

• Code length: 730 lines/app (average)g pp ( g )


Mapping of Applicationspp g pp

lity

highal

Loc

a

BMT

Temp

ora

ConvEncP tR di

3DESBT

Spat i al Local i t y

Rgb2YuvPartRadio

highSpat i al Local i t y high


NMP Speedup Over Main Processor

NMP Speedup over Main Processor

N Speedup Ove a ocesso

18

20 V+S+B

12

14

16

B

8

10

12

S+B

2

4

6

VV V+S

0Rgb2yuv ConvEnc BMT BT 3DES PartRadio geo.M


NMP Speedup Over Main Processor

Speedup over Main Processor

N Speedup Ove a ocesso

p p

18

20 V+S+B

14

16

nmpB

8

10

12nmpnovecnobitnomt

S+B

4

6

8none

V

S+B

V V+S

0

2

Rgb2yuv ConvEnc BMT BT 3DES PartRadio geo.M

V V+S


ResultsResults

• Nmp much faster than main: geometric mean speedup of 4.9

• Vector support is key to the speedups in many vectorizable apps (Rgb2yuv, 3DES, and PartRadio)

• Bit manipulation support is key in some apps (BMT and BT)

• ConvEnc needs both vectorization and bit manipulation support


Adding the Stream Appg pp

18

20

V+S+BV VV

12

14

16 B VV

VV

8

10

V

S+B

2

4

6 VV V+S

0Rgb2yuv ConvEnc BMT BT 3DES PartRadio Copy Scale Add Triad geo.M


Adding the Stream Appg pp

16

18

20 V+S+B

V VV V

10

12

14nmp

novec

nobit

B V VV V

6

8

10 nobit

nomtnone

V

S+B

V

0

2

4

T

V V

Rgb2yu

v

ConvE

nc BMT BT

3DESPart

Radio

Copy

Scale Add

Triad

geo.M


Conclusions

• Selected one NMP design point:Selected one NMP design point:– Off-load vector, streaming, bit manipulation ops– Simple/compact design– Fairly general purpose core

E al ated design ass ming NMP on processor chip• Evaluated design assuming NMP on processor chip

• High performance for high-bandwidth applications:High performance for high-bandwidth applications:– 4.9 x faster than main processor – Most cost-effective support in the NMP:

• Vectors• Bit manipulation


Future Work

• Focusing on programmability:Focusing on programmability:– Programming interface– Compiler supportp pp

• Interaction with the OS

• Step back: How far can we go with PowerPC ISA (and smallStep back: How far can we go with PowerPC ISA (and small extensions)? – No vector compiler– Libraries that perform efficient bit manipulation


Near Memory Processing (NMP)

Mi li W i M S i J T ll d R B tt

y g ( )

Mingliang Wei, Marc Snir, Josep Torrellas and R. Brett Tremaine

University of Illinois and IBMUniversity of Illinois and IBM

F b 2005February 2005


RGB2YUV (Vector)G UV (Vecto )

Thread 1 Thread 2 Thread 3 Thread 4

… … … … … …

… …1 pixel

… … … … … …

… …•Input: 1000 x 200 pixels, output is the same size

C i f h i l i i d d t

1 pixel

•Conversion of each pixel is independent

•4 threads, each processes a partition of the input

•Each thread processes 16 pixels in parallel


•Each thread processes 16 pixels in parallel

BMT (Bit manip)( p)

1024 bits T T T T TT T T T

T T T T

1024 bitsT T T T

T T T TT T T T

•Transpose 4 1024x1024 bit matrices

T

T T T T •4 threads, each trans. 1 matrix

•Threads use:T

TBmm_vec • Vector load to load data

•Bmm_vec to trans. a tile

Wei/Snir/Torrellas: Near Memory Processing 29A 64x64 bit tile •Vector store

3DES (Vector)( )

• Input: set of 64-bit word values. Output is same-size encrypted word values.

• Vectorized the load/store and the computation• Counter mode (more parallelism)( p )• 4 threads, each work on a partition of the input (like

Rgb2yuv)g y )• Each thread processes 16 elements in parallel• Bmm to perform initial and final permutation (though a p p ( g

small fraction of total execution time)


ConvEnc (Vector + stream + bit)ConvEnc (Vector + stream + bit)

Stream buffer in Stream buffer in the scratchpad the scratchpad

Reader

Thread Encoder WriterThread Encoder

Thread

Writer

Thread

Input … …p

Stream

Output… …

•Input: bit stream; output: 2x bit stream

Stream

•Encode 64 bytes at a time

•Sshift instruction to shift 64 bytes of data

Mi i t ti t i t l t bl k f bitWei/Snir/Torrellas: Near Memory Processing 31

•Mix instruction to interleave two blocks of bits

BT (Stream + bit manip)( p)

Generate {Ai}{ i}{Bi} = take 5, drop 6, take 5 … from {Ai}{Ci} = drop 5, take 5, drop 6, take 5, drop 6 from {Ai}{Di } = (Ci or not(Bi+1)) xor (not (Ci) or Bi-1){Ei } = Di xor Di+37 xor Di+100Identify sequences of 0s in E that are longer than 100

•Three threads in the NMP: Generator, Splitter, Counter•Bmm to split the {Ai}

Generator Counter Splitter

{Bi}

{Ci}{Ai}

Stream BStream C

Thread ThreadThreadStream A

{Di}Stream D


{ i}

PartRadio (Vector + stream)PartRadio (Vector + stream)

• Input: a float-point stream. Output is a float-point stream• 3 threads (like ConvEnc)( )

– Low pass filter– Demodulator – Equalizer

• Thread processes 16 elements in parallel (vector)


Data Movement

L3 MemoryL2

ScratchpadScratchpad


Documents

A Near-Memory Processor (NMP) for Vector, Streaming, and ... · • Near Memoryy() Processor (NMP): – Optimized for bandwidth (e.g. no caches) – Not necessarily physically closer