Download pdf - Programmable Near-Memory Acceleration on ConTutto · Programmable Near-Memory Acceleration Conventional computer architecture Memory system is a “slave” of the host processor

Revolutionizing the Datacenter

Join the Conversation #OpenPOWERSummit

Programmable Near-Memory Accelerationon ConTutto

Jan van Lunteren, IBM Research

Join the Conversation #OpenPOWERSummit

TeamIBM Zurich (CH)

Jan van Lunteren, Christoph Hagleitner

IBM Dwingeloo (NL)

Leandro Fiorin, Erik Vermij

IBM Boeblingen (DE)

Angelo Haller, Jörg-Stephan Vogt, Harald Huels

IBM Burlington, Poughkeepsie, Rochester, Yorktown (US)

Thomas Roewer, Bharat Sukhwani, Adam McPadden, Dean Sanner,Dave Cadigan, Sameh Asaad

24/1/2016

MemoryDIMMs

MemoryDIMMs

MemoryDIMMs

MemoryDIMMs

POWER8TM Memory System

34/1/2016

POWER8TM

processor

Centaur

Centaur

Centaur

Centaur

Centaur

Centaur

Centaur

CentaurMemoryDIMMs

MemoryDIMMs

MemoryDIMMs

MemoryDIMMs

DMI

DMI

DMI

DMI

DMI

DMI

DMI

DMI

MemoryDIMMs

MemoryDIMMs

MemoryDIMMs

MemoryDIMMs


34/1/2016

Centaur

Centaur

Centaur

Centaur

ConTuttoFPGA

ConTuttoFPGA

Centaur

CentaurMemoryDIMMs

MemoryDIMMs

MemoryDIMMs

MemoryDIMMs

DMI

DMI

DMI

DMI

DMI

DMI

DMI

DMI

POWER8TM

processor

MemoryDIMMs

MemoryDIMMs

MemoryDIMMs

MemoryDIMMs


34/1/2016

Centaur

Centaur

Centaur

Centaur

ConTuttoFPGA

Centaur

CentaurMemoryDIMMs

MemoryDIMMs

New MemoryTechnologies

ConTuttoFPGA

MemoryDIMMs

DMI

DMI

DMI

DMI

DMI

DMI

DMI

DMI

POWER8TM

processor

MemoryDIMMs

MemoryDIMMs

MemoryDIMMs

MemoryDIMMs


34/1/2016

Centaur

Centaur

Centaur

Centaur

ConTuttoFPGA

Near-Memory

Acceleration

Centaur

CentaurMemoryDIMMs

MemoryDIMMs

MemoryDIMMs

New MemoryTechnologies

DMI

DMI

DMI

DMI

DMI

DMI

DMI

DMI

POWER8TM

processor

Trends

44/1/2016

HPC system-level power break-downSource: R. Nair, “Active Memory Cube,”

2nd Workshop on Near-Data Processing, 2014

20182012

Chip-level energy trendsSource: S. Borkar, “Exascale Computing - a fact or a fiction?,”

IPDPS, 2013

Power consumption is increasingly dominated

by data transfer and memory

Solutions

54/1/2016

Specialization

Workload-optimized systems: holistic optimization of HW/SW stack

General-purpose accelerators: GPUs, FPGAs, DSPs

Reduced programmability: fixed-function accelerators (ASICs)

orders-of-magnitude performance/power improvements for selected workloads

Near-memory computing

Bring computation closer to the data(e.g., card, package, chip, memory periphery/array)

Reduce power-expensive data transfers by movingfrom compute-centric to data-centric model

Near-memory computing in 3D stack Data-centric computing

Can we combine Workload optimization and Near-memory computing?

Memory performance and power consumption depend on acomplex interaction between workload and memory system

• locality of reference, access patterns/strides, etc.

• cache size, associativity, replacement policy, etc.

• bank interleaving, refresh, row buffer hits, etc.

Memory system typically is a “black box”

Memory system operation is mostly fixed providing no or verylimited options for adaptation to the workload characteristics

‒ opposite happens: “bare metal” programming to adaptworkload to memory system

Challenges

64/1/2016

Can we make the memory system programmable/adaptive ?

How can we integrate programmable compute capabilities to achieve substantial performance and power gains for a wide range of workloads

Programmable Near-Memory AccelerationConventional computer architecture

Memory system is a “slave” of the host processor

74/1/2016

Main Memory

Core CoreCore

memorycontroller(s)

shared L3 cache

L1/L2cache

L1/L2cache

L1/L2cache

…

…

reg.filereg.filereg.file

DRAMbank

row buf

DRAMbank

row buf

DRAMbank

row buf

…



Novel approach

Memory system actively participates to ensure that data is stored, accessed and transferred in the most (power-) efficient way resulting in the highest performance/Watt

Memory system integrates compute capabilities

74/1/2016

Main Memory

Core CoreCore

memorycontroller(s)

shared L3 cache

L1/L2cache

L1/L2cache

L1/L2cache

…

…


DRAMbank

row buf

DRAMbank

row buf

DRAMbank

row buf

…



Novel approach

Memory system actively participates to ensure that data is stored, accessed and transferred in the most (power-)efficient way resulting in the highest performance/Watt

Memory system integrates compute capabilities

Memory Controller → Access Processor

Novel programmable architecture

Enabling/differentiating technologies:

• programmable state machine technology

• programmable address mapping scheme

• power-efficient “self-running” instructions

Near-memory accelerators attach to Access Processor

74/1/2016

Main Memory

Core CoreCore

shared L3 cache

L1/L2cache

L1/L2cache

L1/L2cache

…

…


DRAMbank

row buf

DRAMbank

row buf

DRAMbank

row buf

…

AcceleratorAcceleratorNear-MemoryAccelerators

AccessProcessor

Access Processor (AP)Basic memory controller functions

Address mapping, access scheduling, refresh, pageopen/close, etc., are programmable

Memory details (bank organization, retention times,etc.) are exposed to AP program

84/1/2016

Main Memory

Core CoreCore

shared L3 cache

L1/L2cache

L1/L2cache

L1/L2cache

…

…


DRAMbank

row buf

DRAMbank

row buf

DRAMbank

row buf

…


AccessProcessor

Access Processor (AP)Basic memory controller functions



Near-Memory Accelerator (NMA) support

AP-NMA interface types

‒ L1: tightly coupled, AP generates addresses

‒ L2: loosely coupled, AP generates addresses

‒ L3: loosely coupled, NMA generates addresses

84/1/2016

Main Memory

Core CoreCore

shared L3 cache

L1/L2cache

L1/L2cache

L1/L2cache

…

…


DRAMbank

row buf

DRAMbank

row buf

DRAMbank

row buf

…


AccessProcessor

Access Processor (AP)

84/1/2016

Main Memory

Core CoreCore

shared L3 cache

L1/L2cache

L1/L2cache

L1/L2cache

…

…


DRAMbank

row buf

DRAMbank

row buf

DRAMbank

row buf

…


AccessProcessor

Pro

cesso

r M

emo

ry

M

emo

ry

NMA

Basic memory controller functions








Arbitration of processor and NMA accesses

• fine-grained access bandwidth control



84/1/2016

Main Memory

Core CoreCore

shared L3 cache

L1/L2cache

L1/L2cache

L1/L2cache

…

…


DRAMbank

row buf

DRAMbank

row buf

DRAMbank

row buf

…

AccessProcessor

Pro

cesso

r M

emo

ry

Pro

cessor

NM

A











Interception/redirection/copy of processor accessesto enable on-the-fly processing, snooping/caching

• address translation tables (virtual/physical)

NMAM

emo

ry



84/1/2016

Main Memory

Core CoreCore

shared L3 cache

L1/L2cache

L1/L2cache

L1/L2cache

…

…


DRAMbank

row buf

DRAMbank

row buf

DRAMbank

row buf

…

AccessProcessor

Pro

cesso

r M

emo

ry

Pro

cessor

NM

A











Interception/redirection/copy of processor accessesto enable on-the-fly processing, snooping/caching

• address translation tables (virtual/physical)

On

-the-Fly P

rocessin

g

M

emo

ry

NMA

Mem

ory

Pro

cessor

Access Processor (AP)Near-Memory Accelerator support (continued)

Applications executed on host processor interact with AP through special instructions (e.g., PowerEN icswx) and/or special data structures mapped on AP command port

AP can be dynamically (re)programmed during runtime

• binary loaded from cache or main memory

AP is multi-threaded, provides multi-session support

AP manages NMA configuration

• configures execution pipelines, loadsparameters, constants, etc.

• dynamic reconfiguration of FPGA-based NMAs

controls storage, access and transfer ofconfiguration data from main memory to NMAs

Performance monitoring

Multiple APs interconnect to scale to larger systems

94/1/2016

Main Memory

Core CoreCore

shared L3 cache

L1/L2cache

L1/L2cache

L1/L2cache

…

…


DRAMbank

row buf

DRAMbank

row buf

DRAMbank

row buf

…


AccessProcessor

AcceleratorAccelerator

AcceleratorAccelerator

Main Memory

Access Processor (AP)Near-Memory Accelerator support (continued)

Applications executed on host processor interact with AP through special instructions (e.g., PowerEN icswx) and/or special data structures mapped on AP command port

AP can be dynamically (re)programmed during runtime

• binary loaded from cache or main memory

AP is multi-threaded, provides multi-session support

AP manages NMA configuration

• configures execution pipelines, loadsparameters, constants, etc.

• dynamic reconfiguration of FPGA-based NMAs

controls storage, access and transfer ofconfiguration data from main memory to NMAs

Performance monitoring

Multiple APs interconnect to scale to larger systems

94/1/2016

Core CoreCore

shared L3 cache

L1/L2cache

L1/L2cache

L1/L2cache

…

…


DRAMbank

row buf

DRAMbank

row buf

DRAMbank

row buf

…


AcceleratorAcceleratorAccessProcessor

Near-Memory Acceleration on ConTuttoConTutto

Ideal platform to investigate and experiment with Near-Memory Accelerationon a commercial OpenPOWER server, addressing multiple aspects:

• design of near-memory accelerator devices

• integration into computer system architecture

• use of multiple devices to scale to larger storage and processing capabilities

• programming of a hybrid system based on near-memory computing

• applications

Demonstration of initial implementation of Programmable Near-Memory

Accelerator concept on ConTutto for FFT computation at the IBM booth

Ongoing work

• design space exploration covering device, system and application levels

• development of near-memory computing tool set and ecosystem including compiler, debugger, performance analysis, and run-time optimization tools

104/1/2016

Concluding remarks This work has been initiated as part of the DOME project, in which IBM and the

Netherlands Institute for Radio Astronomy (ASTRON) jointly perform fundamental research on large-scale green Exascale computing for the Square Kilometre Array (SKA), which will become the largest and most sensitive radio telescope in the world

Three PhD positions available as part of European Union Horizon 2020 / Marie Curie ITN-EID program NeMeCo which is aimed at developing power-efficient HPC systems for Big-data processing based on the exploitation of near-memory computing

• topics:

‒ run-time optimization

‒ compiler technologies

‒ near-memory accelerator architecture

more information at http://ec.europa.eu/euraxess/ keyword: NeMeCo

114/1/2016

http://ec.europa.eu/euraxess/

Backup Material

124/1/2016

B-FSM TechnologyProgrammable state machine

Efficient multi-way branches involving evaluationof many (combinations of) conditions in parallel:

• loop conditions, counters, timers, data arrival, etc.

Compact data structure

Fast deterministic reaction time

• dispatch instructions within 2 cycles (@ > 2 GHz)

Multi-threaded operation

Successful application to a range of accelerators

Regular expression scanners, protocol engines,XML parsers, near-memory accelerators

Processing rates of ~20 Gbit/s per B-FSM in 45 nm

Small area cost enables scaling to extremely highaggregate processing rates

134/1/2016

Access Processor

B-FSM

Near-Memory Acceleration in 3D Stack

144/1/2016