Revolutionizing the Datacenter
Join the Conversation #OpenPOWERSummit
Programmable Near-Memory Accelerationon ConTutto
Jan van Lunteren, IBM Research
Join the Conversation #OpenPOWERSummit
TeamIBM Zurich (CH)
Jan van Lunteren, Christoph Hagleitner
IBM Dwingeloo (NL)
Leandro Fiorin, Erik Vermij
IBM Boeblingen (DE)
Angelo Haller, Jörg-Stephan Vogt, Harald Huels
IBM Burlington, Poughkeepsie, Rochester, Yorktown (US)
Thomas Roewer, Bharat Sukhwani, Adam McPadden, Dean Sanner,Dave Cadigan, Sameh Asaad
24/1/2016
MemoryDIMMs
MemoryDIMMs
MemoryDIMMs
MemoryDIMMs
POWER8TM Memory System
34/1/2016
POWER8TM
processor
Centaur
Centaur
Centaur
Centaur
Centaur
Centaur
Centaur
CentaurMemoryDIMMs
MemoryDIMMs
MemoryDIMMs
MemoryDIMMs
DMI
DMI
DMI
DMI
DMI
DMI
DMI
DMI
MemoryDIMMs
MemoryDIMMs
MemoryDIMMs
MemoryDIMMs
POWER8TM Memory System
34/1/2016
Centaur
Centaur
Centaur
Centaur
ConTuttoFPGA
ConTuttoFPGA
Centaur
CentaurMemoryDIMMs
MemoryDIMMs
MemoryDIMMs
MemoryDIMMs
DMI
DMI
DMI
DMI
DMI
DMI
DMI
DMI
POWER8TM
processor
MemoryDIMMs
MemoryDIMMs
MemoryDIMMs
MemoryDIMMs
POWER8TM Memory System
34/1/2016
Centaur
Centaur
Centaur
Centaur
ConTuttoFPGA
Centaur
CentaurMemoryDIMMs
MemoryDIMMs
New MemoryTechnologies
ConTuttoFPGA
MemoryDIMMs
DMI
DMI
DMI
DMI
DMI
DMI
DMI
DMI
POWER8TM
processor
MemoryDIMMs
MemoryDIMMs
MemoryDIMMs
MemoryDIMMs
POWER8TM Memory System
34/1/2016
Centaur
Centaur
Centaur
Centaur
ConTuttoFPGA
Near-Memory
Acceleration
Centaur
CentaurMemoryDIMMs
MemoryDIMMs
MemoryDIMMs
New MemoryTechnologies
DMI
DMI
DMI
DMI
DMI
DMI
DMI
DMI
POWER8TM
processor
Trends
44/1/2016
HPC system-level power break-downSource: R. Nair, “Active Memory Cube,”
2nd Workshop on Near-Data Processing, 2014
20182012
Chip-level energy trendsSource: S. Borkar, “Exascale Computing - a fact or a fiction?,”
IPDPS, 2013
Power consumption is increasingly dominated
by data transfer and memory
Solutions
54/1/2016
Specialization
Workload-optimized systems: holistic optimization of HW/SW stack
General-purpose accelerators: GPUs, FPGAs, DSPs
Reduced programmability: fixed-function accelerators (ASICs)
orders-of-magnitude performance/power improvements for selected workloads
Near-memory computing
Bring computation closer to the data(e.g., card, package, chip, memory periphery/array)
Reduce power-expensive data transfers by movingfrom compute-centric to data-centric model
Near-memory computing in 3D stack Data-centric computing
Can we combine Workload optimization and Near-memory computing?
Memory performance and power consumption depend on acomplex interaction between workload and memory system
• locality of reference, access patterns/strides, etc.
• cache size, associativity, replacement policy, etc.
• bank interleaving, refresh, row buffer hits, etc.
Memory system typically is a “black box”
Memory system operation is mostly fixed providing no or verylimited options for adaptation to the workload characteristics
‒ opposite happens: “bare metal” programming to adaptworkload to memory system
Challenges
64/1/2016
Can we make the memory system programmable/adaptive ?
How can we integrate programmable compute capabilities to achieve substantial performance and power gains for a wide range of workloads
Programmable Near-Memory AccelerationConventional computer architecture
Memory system is a “slave” of the host processor
74/1/2016
Main Memory
Core CoreCore
memorycontroller(s)
shared L3 cache
L1/L2cache
L1/L2cache
L1/L2cache
…
…
reg.filereg.filereg.file
DRAMbank
row buf
DRAMbank
row buf
DRAMbank
row buf
…
Programmable Near-Memory AccelerationConventional computer architecture
Memory system is a “slave” of the host processor
Novel approach
Memory system actively participates to ensure that data is stored, accessed and transferred in the most (power-) efficient way resulting in the highest performance/Watt
Memory system integrates compute capabilities
74/1/2016
Main Memory
Core CoreCore
memorycontroller(s)
shared L3 cache
L1/L2cache
L1/L2cache
L1/L2cache
…
…
reg.filereg.filereg.file
DRAMbank
row buf
DRAMbank
row buf
DRAMbank
row buf
…
Programmable Near-Memory AccelerationConventional computer architecture
Memory system is a “slave” of the host processor
Novel approach
Memory system actively participates to ensure that data is stored, accessed and transferred in the most (power-)efficient way resulting in the highest performance/Watt
Memory system integrates compute capabilities
Memory Controller → Access Processor
Novel programmable architecture
Enabling/differentiating technologies:
• programmable state machine technology
• programmable address mapping scheme
• power-efficient “self-running” instructions
Near-memory accelerators attach to Access Processor
74/1/2016
Main Memory
Core CoreCore
shared L3 cache
L1/L2cache
L1/L2cache
L1/L2cache
…
…
reg.filereg.filereg.file
DRAMbank
row buf
DRAMbank
row buf
DRAMbank
row buf
…
AcceleratorAcceleratorNear-MemoryAccelerators
AccessProcessor
Access Processor (AP)Basic memory controller functions
Address mapping, access scheduling, refresh, pageopen/close, etc., are programmable
Memory details (bank organization, retention times,etc.) are exposed to AP program
84/1/2016
Main Memory
Core CoreCore
shared L3 cache
L1/L2cache
L1/L2cache
L1/L2cache
…
…
reg.filereg.filereg.file
DRAMbank
row buf
DRAMbank
row buf
DRAMbank
row buf
…
AcceleratorAcceleratorNear-MemoryAccelerators
AccessProcessor
Access Processor (AP)Basic memory controller functions
Address mapping, access scheduling, refresh, pageopen/close, etc., are programmable
Memory details (bank organization, retention times,etc.) are exposed to AP program
Near-Memory Accelerator (NMA) support
AP-NMA interface types
‒ L1: tightly coupled, AP generates addresses
‒ L2: loosely coupled, AP generates addresses
‒ L3: loosely coupled, NMA generates addresses
84/1/2016
Main Memory
Core CoreCore
shared L3 cache
L1/L2cache
L1/L2cache
L1/L2cache
…
…
reg.filereg.filereg.file
DRAMbank
row buf
DRAMbank
row buf
DRAMbank
row buf
…
AcceleratorAcceleratorNear-MemoryAccelerators
AccessProcessor
Access Processor (AP)
84/1/2016
Main Memory
Core CoreCore
shared L3 cache
L1/L2cache
L1/L2cache
L1/L2cache
…
…
reg.filereg.filereg.file
DRAMbank
row buf
DRAMbank
row buf
DRAMbank
row buf
…
AcceleratorAcceleratorNear-MemoryAccelerators
AccessProcessor
Pro
cesso
r M
emo
ry
M
emo
ry
NMA
Basic memory controller functions
Address mapping, access scheduling, refresh, pageopen/close, etc., are programmable
Memory details (bank organization, retention times,etc.) are exposed to AP program
Near-Memory Accelerator (NMA) support
AP-NMA interface types
‒ L1: tightly coupled, AP generates addresses
‒ L2: loosely coupled, AP generates addresses
‒ L3: loosely coupled, NMA generates addresses
Arbitration of processor and NMA accesses
• fine-grained access bandwidth control
AcceleratorAcceleratorNear-MemoryAccelerators
Access Processor (AP)
84/1/2016
Main Memory
Core CoreCore
shared L3 cache
L1/L2cache
L1/L2cache
L1/L2cache
…
…
reg.filereg.filereg.file
DRAMbank
row buf
DRAMbank
row buf
DRAMbank
row buf
…
AccessProcessor
Pro
cesso
r M
emo
ry
Pro
cessor
NM
A
Basic memory controller functions
Address mapping, access scheduling, refresh, pageopen/close, etc., are programmable
Memory details (bank organization, retention times,etc.) are exposed to AP program
Near-Memory Accelerator (NMA) support
AP-NMA interface types
‒ L1: tightly coupled, AP generates addresses
‒ L2: loosely coupled, AP generates addresses
‒ L3: loosely coupled, NMA generates addresses
Arbitration of processor and NMA accesses
• fine-grained access bandwidth control
Interception/redirection/copy of processor accessesto enable on-the-fly processing, snooping/caching
• address translation tables (virtual/physical)
NMAM
emo
ry
AcceleratorAcceleratorNear-MemoryAccelerators
Access Processor (AP)
84/1/2016
Main Memory
Core CoreCore
shared L3 cache
L1/L2cache
L1/L2cache
L1/L2cache
…
…
reg.filereg.filereg.file
DRAMbank
row buf
DRAMbank
row buf
DRAMbank
row buf
…
AccessProcessor
Pro
cesso
r M
emo
ry
Pro
cessor
NM
A
Basic memory controller functions
Address mapping, access scheduling, refresh, pageopen/close, etc., are programmable
Memory details (bank organization, retention times,etc.) are exposed to AP program
Near-Memory Accelerator (NMA) support
AP-NMA interface types
‒ L1: tightly coupled, AP generates addresses
‒ L2: loosely coupled, AP generates addresses
‒ L3: loosely coupled, NMA generates addresses
Arbitration of processor and NMA accesses
• fine-grained access bandwidth control
Interception/redirection/copy of processor accessesto enable on-the-fly processing, snooping/caching
• address translation tables (virtual/physical)
On
-the-Fly P
rocessin
g
M
emo
ry
NMA
Mem
ory
Pro
cessor
Access Processor (AP)Near-Memory Accelerator support (continued)
Applications executed on host processor interact with AP through special instructions (e.g., PowerEN icswx) and/or special data structures mapped on AP command port
AP can be dynamically (re)programmed during runtime
• binary loaded from cache or main memory
AP is multi-threaded, provides multi-session support
AP manages NMA configuration
• configures execution pipelines, loadsparameters, constants, etc.
• dynamic reconfiguration of FPGA-based NMAs
controls storage, access and transfer ofconfiguration data from main memory to NMAs
Performance monitoring
Multiple APs interconnect to scale to larger systems
94/1/2016
Main Memory
Core CoreCore
shared L3 cache
L1/L2cache
L1/L2cache
L1/L2cache
…
…
reg.filereg.filereg.file
DRAMbank
row buf
DRAMbank
row buf
DRAMbank
row buf
…
AcceleratorAcceleratorNear-MemoryAccelerators
AccessProcessor
AcceleratorAccelerator
AcceleratorAccelerator
Main Memory
Access Processor (AP)Near-Memory Accelerator support (continued)
Applications executed on host processor interact with AP through special instructions (e.g., PowerEN icswx) and/or special data structures mapped on AP command port
AP can be dynamically (re)programmed during runtime
• binary loaded from cache or main memory
AP is multi-threaded, provides multi-session support
AP manages NMA configuration
• configures execution pipelines, loadsparameters, constants, etc.
• dynamic reconfiguration of FPGA-based NMAs
controls storage, access and transfer ofconfiguration data from main memory to NMAs
Performance monitoring
Multiple APs interconnect to scale to larger systems
94/1/2016
Core CoreCore
shared L3 cache
L1/L2cache
L1/L2cache
L1/L2cache
…
…
reg.filereg.filereg.file
DRAMbank
row buf
DRAMbank
row buf
DRAMbank
row buf
…
AcceleratorAcceleratorNear-MemoryAccelerators
AcceleratorAcceleratorAccessProcessor
Near-Memory Acceleration on ConTuttoConTutto
Ideal platform to investigate and experiment with Near-Memory Accelerationon a commercial OpenPOWER server, addressing multiple aspects:
• design of near-memory accelerator devices
• integration into computer system architecture
• use of multiple devices to scale to larger storage and processing capabilities
• programming of a hybrid system based on near-memory computing
• applications
Demonstration of initial implementation of Programmable Near-Memory
Accelerator concept on ConTutto for FFT computation at the IBM booth
Ongoing work
• design space exploration covering device, system and application levels
• development of near-memory computing tool set and ecosystem including compiler, debugger, performance analysis, and run-time optimization tools
104/1/2016
Concluding remarks This work has been initiated as part of the DOME project, in which IBM and the
Netherlands Institute for Radio Astronomy (ASTRON) jointly perform fundamental research on large-scale green Exascale computing for the Square Kilometre Array (SKA), which will become the largest and most sensitive radio telescope in the world
Three PhD positions available as part of European Union Horizon 2020 / Marie Curie ITN-EID program NeMeCo which is aimed at developing power-efficient HPC systems for Big-data processing based on the exploitation of near-memory computing
• topics:
‒ run-time optimization
‒ compiler technologies
‒ near-memory accelerator architecture
more information at http://ec.europa.eu/euraxess/ keyword: NeMeCo
114/1/2016
Backup Material
124/1/2016
B-FSM TechnologyProgrammable state machine
Efficient multi-way branches involving evaluationof many (combinations of) conditions in parallel:
• loop conditions, counters, timers, data arrival, etc.
Compact data structure
Fast deterministic reaction time
• dispatch instructions within 2 cycles (@ > 2 GHz)
Multi-threaded operation
Successful application to a range of accelerators
Regular expression scanners, protocol engines,XML parsers, near-memory accelerators
Processing rates of ~20 Gbit/s per B-FSM in 45 nm
Small area cost enables scaling to extremely highaggregate processing rates
134/1/2016
Access Processor
B-FSM
Near-Memory Acceleration in 3D Stack
144/1/2016