Key Ideas Goal: Increase memory bandwidth 100X using in

Key Ideas•Goal:

–Increase memory bandwidth 100X using in-memory processors

•First smart-memory PIM device that is –Capable of executing independent threads of control–Designed to support in-memory virtual addressing

•Target applications–Image processing and multimedia (streaming)–Irregular memory accesses (sparse-matrices, graphs and pointers)

•Evolutionary application migration path–PIMs also perform standard memory accesses–System supports familiar parallel programming paradigms

System Architecture

PIM InterconnectPIM Interconnect

PIMPIM PIMPIM PIMPIM PIMPIM PIMPIM PIMPIM PIMPIM PIMPIM

Processor Memory BusProcessor Memory Bus

HostHostProcessorProcessor(Itanium-2)(Itanium-2)

Arc

hite

ctur

eA

rchi

tect

ure

WideWordWideWordDatapathDatapath

WideWordRegister

File256b

ICacheICache

MEMORYMEMORYCONTROLCONTROL

&&ARBITERARBITER

DATA DATA REGISTERS

Node Main Data BusNode Main Data Bus

CTL

Node Memory Requests

HostMemory

Requests

ICache Mem Requests

HEADER REGISTERS

PARCEL BUFFER (“PBUF”)HOST “DRAM” INTERFACE

DRAMDRAMMEMORYMEMORY

32b

256b256b

ScalarRegister

File

Instr.

Inter-datapath Register Data

DatapathControl

InstructionInstructionPipelinePipeline

ScalarScalarDatapathDatapath

PIM Node ArchitectureSystem Architecture

PIMs in Action, Delivering Memory BandwidthPIMs in Action, Delivering Memory BandwidthTim Barrett, Spundun Bhatt, Jacqueline Chame Jeff Draper, Mary HallTim Barrett, Spundun Bhatt, Jacqueline Chame Jeff Draper, Mary Hall

USCVITERBI

SCHOOL OFENGINEERING

BGA Top and Bottom Views

SRAMSRAM

Node Processing Logic, Pbuf, DDR SDRAM

Interface, PiRC

SRAMSRAM

System includes four boards with eight PIM chips

PIM DIMMs in IA64 Host

Syst

em P

roto

type

Syst

em P

roto

type PIM Chip (2nd Gen.)

TSMC 0.18m technology

C/ Fortran program

PIM Node Compiler Technology

Superword-Extended GCC

Macintosh G4

executable

Superword instruction extended C program

Superword Locality Optimizations Compiler-Controlled Caching Page Mode Memory AccessesSLP in Presence of Control Flow

Pre-existing ISI Extensions

DIVAexecutable

MIT-SLP

Overview of Implementation* Insertion of custom PIMs in the memory space of a commodity platform

- Itanium-2 HP zx6000- DDRAM interface

* System software- Linux 2.4 & 2.6 device driver for PIM- compiler technology

* Challenges related to reliability and bandwidth features of commodity memory systems

- ECC- ChipSpare (ChipKill)- address line interleaving

Mea

sure

men

tsM

easu

rem

ents

8 x121mm2

121mm2

421mm2

Area

single-issue, in-order, pipelined

single-issue, in-order, pipelined

EPIC, 6-way

CPU Info

~8W453M(440M memory)

140 MHz8 PIMs

~1W56.6M(55M memory)

140 MHz1 PIM

~100W221M900 MHzItanium-2

PowerTransistorClock

Comparison: Itanium-2 vs. PIM

Exe

c. T

ime

in

sec

Data set size in # elts

StreamAdd Performance StreamAdd Data Layout Sensitivity

Exe

c. T

ime

in

sec

Offset between arrays in # elts

This material is based on research sponsored by AFRL and NSA under agreement number FA8750-04-1-0265. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of AFRL and NSA or the U.S. Government.

Cur

rent

Wor

kC

urre

nt W

ork

Objective Evaluate link discovery (LD) algorithms on Godiva H/W.

Hypothesis LD algorithms are data-intensive and highly parallel Largely read-only data Irregular memory accesses poor cache performance PIM technology would yield performance improvement

Expected Results Parallel PIM implementations of LD computations Performance comparisons with Itanium-2 host Analysis of software/hardware scalability requirements Analysis of programming complexity

-

KR&RPowerLoom logic-

based KR&R system

• Representation Relational knowledge Rich link vocabulary (Natural Language

extraction) Complex and dynamic domain knowledge Meta-knowledge (“interestingness”, etc.) Multiple hypotheses Threat/Non-Threat Patterns

Target LD Algorithms

Properties Sparse graph algorithms Read-only data Some temporal/spatial locality after partitioning

Examples Mutual information Graph clustering Model-based and mixture model methods such as HMMs

LD Challenges & Solution Approach

• Data Incomplete data Noise, corruption, uncertainty Unaligned entities or groups

• Scale Very large data volume Connectivity curse

RDBMS, parallelism, focused search

KR&R + partial logical inference

Statistical methodsrarity analysis, mutual

information, HHMM

PIMS for KNOWLEDGE DISCOVERY:in collaboration with Hans Chalupsky & Jafar Adibi, USC ISI

Original Code

Ran

dom

Acc

ess

Ran

dom

Acc

ess Host and PIM Algorithm Overview Random Access:

Host Alone & Host w/ PIMSHost and PIM Code

// Host-only RandomAccess

uInt64 Table[TABLESIZE];uInt64 ran;

// initialize main tablefor (i=0; i<TableSize; i++) Table[i]=i;

// perform updatesran = 1;for (i=0; i<NUpdates; i++) { ran = (ran << 1) ^ (ran < 0? POLY:0); Table[ran & (TableSize-1)] ^= ran;}

// Host code for Host-and-PIM RandomAccess// initialize main tablefor (i=0; i<TableSize; i++) Table[i]=i;ran = 1;for (i=0; i<NUpdates; i++) { ran = (ran << 1) ^ (ran < 0? POLY:0); offset = ran & (TableSize-1); parcel.command = UPDATE; parcel.payload[0] = ran; parcel.payload[1] = offset; SendParcel (&Table[offset], parcel);}parcel.command = DONE;for (pim=0; pim<NumPims; pim++) SendParcel(&done[pim], parcel);

// PIM code for Host-and-PIM RandomAccessdone = FALSE;while (!done) { // check for parcels from host processor (non blocking) RecvParcel(hostParcelBuffer, recvStatus, parcel); if (recvStatus == TRUE) { if (parcel.command == UPDATE) { ran = parcel.payload[0]; offset = parcel.payload[1]; Table[offset] ^= ran; // local memory access } else if (parcel.command == DONE) { done = TRUE; } }}

HO

ST

Generate next update pair <ran, offset>

Bucket update for PIM according to address

If bucket full, send parcel

Buckets

b0 b1 bn-1

PI M0

PIM

n

-1

PIM

1

Parcel buffer

MemoryRecv parcel

Perform updates

ran3ran1ran0ran2

Documents

Key Ideas Goal: Increase memory bandwidth 100X using in