Upload
sammy17
View
370
Download
1
Embed Size (px)
Citation preview
Key Ideas•Goal:
–Increase memory bandwidth 100X using in-memory processors
•First smart-memory PIM device that is –Capable of executing independent threads of control–Designed to support in-memory virtual addressing
•Target applications–Image processing and multimedia (streaming)–Irregular memory accesses (sparse-matrices, graphs and pointers)
•Evolutionary application migration path–PIMs also perform standard memory accesses–System supports familiar parallel programming paradigms
System Architecture
PIM InterconnectPIM Interconnect
PIMPIM PIMPIM PIMPIM PIMPIM PIMPIM PIMPIM PIMPIM PIMPIM
Processor Memory BusProcessor Memory Bus
HostHostProcessorProcessor(Itanium-2)(Itanium-2)
Arc
hite
ctur
eA
rchi
tect
ure
WideWordWideWordDatapathDatapath
WideWordRegister
File256b
ICacheICache
MEMORYMEMORYCONTROLCONTROL
&&ARBITERARBITER
DATA DATA REGISTERS
Node Main Data BusNode Main Data Bus
CTL
Node Memory Requests
HostMemory
Requests
ICache Mem Requests
HEADER REGISTERS
PARCEL BUFFER (“PBUF”)HOST “DRAM” INTERFACE
DRAMDRAMMEMORYMEMORY
32b
256b256b
ScalarRegister
File
Instr.
Inter-datapath Register Data
DatapathControl
InstructionInstructionPipelinePipeline
ScalarScalarDatapathDatapath
PIM Node ArchitectureSystem Architecture
PIMs in Action, Delivering Memory BandwidthPIMs in Action, Delivering Memory BandwidthTim Barrett, Spundun Bhatt, Jacqueline Chame Jeff Draper, Mary HallTim Barrett, Spundun Bhatt, Jacqueline Chame Jeff Draper, Mary Hall
USCVITERBI
SCHOOL OFENGINEERING
BGA Top and Bottom Views
SRAMSRAM
Node Processing Logic, Pbuf, DDR SDRAM
Interface, PiRC
SRAMSRAM
System includes four boards with eight PIM chips
PIM DIMMs in IA64 Host
Syst
em P
roto
type
Syst
em P
roto
type PIM Chip (2nd Gen.)
TSMC 0.18m technology
C/ Fortran program
PIM Node Compiler Technology
Superword-Extended GCC
Macintosh G4
executable
Superword instruction extended C program
Superword Locality Optimizations Compiler-Controlled Caching Page Mode Memory AccessesSLP in Presence of Control Flow
Pre-existing ISI Extensions
DIVAexecutable
MIT-SLP
Overview of Implementation* Insertion of custom PIMs in the memory space of a commodity platform
- Itanium-2 HP zx6000- DDRAM interface
* System software- Linux 2.4 & 2.6 device driver for PIM- compiler technology
* Challenges related to reliability and bandwidth features of commodity memory systems
- ECC- ChipSpare (ChipKill)- address line interleaving
Mea
sure
men
tsM
easu
rem
ents
8 x121mm2
121mm2
421mm2
Area
single-issue, in-order, pipelined
single-issue, in-order, pipelined
EPIC, 6-way
CPU Info
~8W453M(440M memory)
140 MHz8 PIMs
~1W56.6M(55M memory)
140 MHz1 PIM
~100W221M900 MHzItanium-2
PowerTransistorClock
Comparison: Itanium-2 vs. PIM
Exe
c. T
ime
in
sec
Data set size in # elts
StreamAdd Performance StreamAdd Data Layout Sensitivity
Exe
c. T
ime
in
sec
Offset between arrays in # elts
This material is based on research sponsored by AFRL and NSA under agreement number FA8750-04-1-0265. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of AFRL and NSA or the U.S. Government.
Cur
rent
Wor
kC
urre
nt W
ork
Objective Evaluate link discovery (LD) algorithms on Godiva H/W.
Hypothesis LD algorithms are data-intensive and highly parallel Largely read-only data Irregular memory accesses poor cache performance PIM technology would yield performance improvement
Expected Results Parallel PIM implementations of LD computations Performance comparisons with Itanium-2 host Analysis of software/hardware scalability requirements Analysis of programming complexity
-
KR&RPowerLoom logic-
based KR&R system
• Representation Relational knowledge Rich link vocabulary (Natural Language
extraction) Complex and dynamic domain knowledge Meta-knowledge (“interestingness”, etc.) Multiple hypotheses Threat/Non-Threat Patterns
Target LD Algorithms
Properties Sparse graph algorithms Read-only data Some temporal/spatial locality after partitioning
Examples Mutual information Graph clustering Model-based and mixture model methods such as HMMs
LD Challenges & Solution Approach
• Data Incomplete data Noise, corruption, uncertainty Unaligned entities or groups
• Scale Very large data volume Connectivity curse
RDBMS, parallelism, focused search
KR&R + partial logical inference
Statistical methodsrarity analysis, mutual
information, HHMM
PIMS for KNOWLEDGE DISCOVERY:in collaboration with Hans Chalupsky & Jafar Adibi, USC ISI
Original Code
Ran
dom
Acc
ess
Ran
dom
Acc
ess Host and PIM Algorithm Overview Random Access:
Host Alone & Host w/ PIMSHost and PIM Code
// Host-only RandomAccess
uInt64 Table[TABLESIZE];uInt64 ran;
// initialize main tablefor (i=0; i<TableSize; i++) Table[i]=i;
// perform updatesran = 1;for (i=0; i<NUpdates; i++) { ran = (ran << 1) ^ (ran < 0? POLY:0); Table[ran & (TableSize-1)] ^= ran;}
// Host code for Host-and-PIM RandomAccess// initialize main tablefor (i=0; i<TableSize; i++) Table[i]=i;ran = 1;for (i=0; i<NUpdates; i++) { ran = (ran << 1) ^ (ran < 0? POLY:0); offset = ran & (TableSize-1); parcel.command = UPDATE; parcel.payload[0] = ran; parcel.payload[1] = offset; SendParcel (&Table[offset], parcel);}parcel.command = DONE;for (pim=0; pim<NumPims; pim++) SendParcel(&done[pim], parcel);
// PIM code for Host-and-PIM RandomAccessdone = FALSE;while (!done) { // check for parcels from host processor (non blocking) RecvParcel(hostParcelBuffer, recvStatus, parcel); if (recvStatus == TRUE) { if (parcel.command == UPDATE) { ran = parcel.payload[0]; offset = parcel.payload[1]; Table[offset] ^= ran; // local memory access } else if (parcel.command == DONE) { done = TRUE; } }}
HO
ST
Generate next update pair <ran, offset>
Bucket update for PIM according to address
If bucket full, send parcel
Buckets
b0 b1 bn-1
PI M0
PIM
n
-1
PIM
1
Parcel buffer
MemoryRecv parcel
Perform updates
ran3ran1ran0ran2