Active Memory Cube

Active Memory Cube

Ravi Nair

IBM T. J. Watson Research Center

December 14, 2014

RN/IBM 2

The Top 500Supercomputers

WoNDP-2 12/14/2014

RN/IBM 3

The Power Story

LinpackTFlops

Power(KW)

MW perExaflops

Tianhe-2 33,862 17,808 526

Titan 17,590 8,209 467

BG/Q 17,173 7,890 459

K-Computer 10,510 12,660 1205

1 MW of power costs roughly $1M per year

WoNDP-2 12/14/2014

RN/IBM 4

The Cost of Power

0

100

200

300

400

500

600

0 1 2 3 4 5

Years of operation

Co

st

of

Po

we

r(M

$)

20 MW 50 MW 100 MW Acquisition Cost

WoNDP-2 12/14/2014

RN/IBM 5

Power Scaling

1 EF/s - 2018

Memory power and I/O to memory become major contributors

20 PF/s - 2012

Link power

Network logic power

DDR chip power

DDR I/O power

L2 cache power

L2 to DDR bus power

L1P to L2 bus power

L1P power

L1D power

Leakage power

Clock power

Integer core power

Floating point power

WoNDP-2 12/14/2014

Active Memory

RN/IBM 6WoNDP-2 12/14/2014

Active Memory Cube

RN/IBM 7

Addition to HMC

WoNDP-2 12/14/2014

Modified baselogic layer of HMC

DRAM layers ofHMC unmodified

Example Exascale Node

RN/IBM 8WoNDP-2 12/14/2014 Figure from [IBM JRD]

Processing Lane

RN/IBM 9WoNDP-2 12/14/2014 Figure from [IBM JRD]

Vector, Scalar and MaskRegisters

RN/IBM 10WoNDP-2 12/14/2014

Processing Lane

RN/IBM WoNDP-2 12/14/2014 11

Power-EfficientMicroarchitecture Direct path from memory to registers

Short latency to memory

Low pj/bit TSV interconnection between memory and processinglanes

No dynamic scheduling of instructions (pipeline exposed to program)

Explicit parallelism

Efficient vector register file

Chained operations to avoid intermediate writes to register file

Largely predecoded commands to various execution units, similar tohorizontal microcoding

No instruction cache

Aggressive gating of unused paths

RN/IBM WoNDP-2 12/14/2014 12

Instruction Set

RN/IBM WoNDP-2 12/14/2014 13

Sample AMC LaneInstruction

RN/IBM WoNDP-2 12/14/2014 14

LaneControl

Rep Slice 0 Slice 1 Slice 2 Slice 3

AL LS

[32] nop {f1mul vr3,vr2,vr1; nop} {x2add vr5,vr7,vr9; nop} {nop; nop} {nop; nop}

Perform operationon next 32elements.

Do not branch atend of instruction– just continue tonext instruction

DP multiply element fromvector register 2 of slice 0

with element in vectorregister 1 of slice 0. Result in

vector register 3 of slice 0.Bump register pointers to

next element in each vector

32-bit integer add2 elements from vector register 7 of

slice 1 with 2 elements in vectorregister 9 of slice 1. Results in

vector register 5 of slice 1. Bumpregister pointers to next element in

each vector

Same operations as above on registers in the scalar register file.Compiler cannot place dependent operation in next instruction.

[1] nop {f1mul sr3,sr2,sr1; nop} {x2add sr5,sr7,sr9; nop} {nop; nop} {nop; nop}

LSU Commands, includingstrides, gather, scatter

RN/IBM WoNDP-2 12/14/2014 15

[32] nop {nop; l8u vr3,sr2,sr1} {nop; l8 vr5,vr7,sr9} {nop; nop} {nop; nop}

LaneControl

Rep Slice 0 Slice 1 Slice 2 Slice 3

AL LS

Performoperation on

next 32elements

STRIDED LOAD:Load DP element into vector

register 3 of slice 0 from addressobtained by adding scalar

register 2 with scalar register 1.Next load will use the updated

value for s1

GATHER:Load DP element into vector register5 of slice 1 from address provided by

element of vector register 7incremented by scalar value in sr9.

Bump vector register to point to nextelement for next address.

Bandwidth and Latency(Sequential Access)

RN/IBM WoNDP-2 12/14/2014 16Figure from [IBM JRD]

Improvement withOpen-Page Mode

RN/IBM WoNDP-2 12/14/2014 17Figure from [IBM JRD]

Coherence

No caches within AMC

Granularity of transfer to lane– 8 bytes to 256 bytes

Coherence bit checked on access– Overhead only on small-granularity

stores

RN/IBM WoNDP-2 12/14/2014 18

DRAM Page: 1024 bytes + 1024 bits metadata

Line: 128 bytes + 128 bits metadata

Vault Block: 32 bytesMetadata

2 bits coherence30 bits ECC + other

AMC API

Goal:– Expose all hardware resources of interest to user

– Allow efficient sharing of AMC execution units

– Allow efficient user-mode access

Logical naming of resources: Physical mapping by OS– OS is minimal kernel, Compute Node Kernel (CNK), employed on BG/Q

Memory Placement: Allocation and relocation of memory

AMC Interface Segment – for lane code and initialized data

Data areas (with addresses in special-purpose registers)

– Common Area: for data common to all lanes

– Lane Stack: for data specific to a single lane

Lane Completion Mechanism

– Atomic fetch and decrement

– Wakeup host

RN/IBM WoNDP-2 12/14/2014 19

AMC Simulator

Supports all features of Instruction Set Architecture

Timing-correct at the AMC level, including internalinterconnect and vaults

Functionally correct at the host level, includingprivileged mode

OS and runtime accurately simulated

Provides support for power and reliabilitycalculation

Provides support for experimenting with differentmix of resources and with new features

RN/IBM WoNDP-2 12/14/2014 20

Compiling to the AMC

RN/IBM WoNDP-2 12/14/2014 21

Vector ISA

Binary

Annotated Program(OpenMP 4.0)

Binary

XL Compiler

Assembler

LLVM

User

User

AMC Compiler

Supports C, C++, Fortran front ends

User annotates program using OpenMP 4.0accelerator directives

– Identify code section to run on AMC

– Identify data region accessed by AMC

RN/IBM WoNDP-2 12/14/2014 22

Compiled Applications

DGEMM

DAXPY

Determinant

LULESH

SNAP

Nekbone

UMT

CoMD

RN/IBM WoNDP-2 12/14/2014 23

Challenges/Opportunitiesfor Compiler Exploitation

Heterogeneity

Programmable length vectors

Gather-scatter

Slice-mapping

Predication using mask registers

Scalar-vector combination code

Memory latency hiding

Instruction Buffer size

RN/IBM WoNDP-2 12/14/2014 24

Techniques Used

Polyhedral framework

Loop nests analyzed for vector exploitation– Loop blocking, loop versioning and loop unrolling applied

in integrated manner

Software I-cache

Limited source code transformation– LULESH Kernel 1: Distribute kernel into two loops to help

scheduling

– Nekbone: Manually inline multiply nested procedure calls

– Will be automated eventually

RN/IBM WoNDP-2 12/14/2014 25

Performance

DGEMM– C = C – A x B

– 83% peak performance:266 GF/AMC Hand assembled

77% through compiler

– 10 W power in 14 nmtechnology

– Roughly 20 GF/W atsystem level

– 7 nm projection: 56 GF/W at AMC

Roughly at target atsystem level

RN/IBM WoNDP-2 12/14/2014 26

Breakdown of Resources

RN/IBM WoNDP-2 12/14/2014 27

Area Breakdown Power Breakdown

Figure from [IBM JRD]

Bandwidth

DAXPY (Y = Y + a x X)

Performance depends on data striping

4.95 computations per cycle per AMC when stripedacross all lanes– Out of 32 possible

– Low because of bandwidth limitation

Improves to 7.66 computations per cycle when datais blocked within vault

Bandwidth utilized– 153.2 GB/s (read), 76.6 GB/s (write) per AMC

– 2.4 TB/s (read), 1.2 TB/s (write) across node

RN/IBM WoNDP-2 12/14/2014 28

Power Distribution

RN/IBM WoNDP-2 12/14/2014 29

Power ConsumptionAcross Applications

Real applications tend to be less power-constrained

RN/IBM WoNDP-2 12/14/2014 30

AMC Power 14nm (32 lanes)

0

2

4

6

8

10

12

dgemm_base

dgemm_atcomb

Nekbone_IP

Nekbone_DAXPY

Nekbone_AX_e

Lulesh1_VOL

COMD_LJ

COMD_EAM

Kernel

Pow

er(W

)

links clkgridleakage lanesnetwork dram

Concluding Remarks

3D technology is reviving interest inProcessing-in-Memory

The AMC design demonstrates a way tomeet DoE Exascale requirements of1 Exaflops in 20 MW

– An AMC-like PIM approach may be the only wayto achieve this target in 7 nm CMOS technology

Widespread use of the AMC technology willdepend on commodity adoption of such3D memory+logic stacks

RN/IBM WoNDP-2 12/14/2014 31

Thank you!

This work was supported in part by Department of Energy ContractB599858 under the FastForward initiative

– PIs: Ravi Nair and Jaime Moreno

Collaborators

– S. F. Antao, C. Bertolli, P. Bose, J. Brunheroto, T. Chen, C.-Y. Cher,C. H. A. Costa, J. Doi, C. Evangelinos, B. M. Fleischer, T. W. Fox,D. S. Gallo, L. Grinberg, J. A. Gunnels, A. C. Jacob, P. Jacob,H. M. Jacobson, T. Karkhanis, C. Kim, J. K. O'Brien, M. Ohmacht, Y. Park,D. A. Prener, B. S. Rosenburg, K. D. Ryu, O. Sallenave, M. J. Serrano,P. Siegl, K. Sugavanam, Z. Sura

Upcoming Paper[IBM JRD] R. Nair et al, “Active Memory Cube: A Processing-in-Memory

Architecture for Exascale Systems”, in IBM Journal of Research &Development, vol. 59, no. 2/3, 2015.

IBM, BG/Q, Blue Gene/Q and Active Memory Cube are trademarks of International BusinessMachines Corp., registered in many jurisdictions worldwide.

RN/IBM WoNDP-2 12/14/2014 32

Documents

Active Memory Cube