Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Active Memory Cube
Ravi Nair
IBM T. J. Watson Research Center
December 14, 2014
RN/IBM 2
The Top 500Supercomputers
WoNDP-2 12/14/2014
RN/IBM 3
The Power Story
LinpackTFlops
Power(KW)
MW perExaflops
Tianhe-2 33,862 17,808 526
Titan 17,590 8,209 467
BG/Q 17,173 7,890 459
K-Computer 10,510 12,660 1205
1 MW of power costs roughly $1M per year
WoNDP-2 12/14/2014
RN/IBM 4
The Cost of Power
0
100
200
300
400
500
600
0 1 2 3 4 5
Years of operation
Co
st
of
Po
we
r(M
$)
20 MW 50 MW 100 MW Acquisition Cost
WoNDP-2 12/14/2014
RN/IBM 5
Power Scaling
1 EF/s - 2018
Memory power and I/O to memory become major contributors
20 PF/s - 2012
Link power
Network logic power
DDR chip power
DDR I/O power
L2 cache power
L2 to DDR bus power
L1P to L2 bus power
L1P power
L1D power
Leakage power
Clock power
Integer core power
Floating point power
WoNDP-2 12/14/2014
Active Memory
RN/IBM 6WoNDP-2 12/14/2014
Active Memory Cube
RN/IBM 7
Addition to HMC
WoNDP-2 12/14/2014
Modified baselogic layer of HMC
DRAM layers ofHMC unmodified
Example Exascale Node
RN/IBM 8WoNDP-2 12/14/2014 Figure from [IBM JRD]
Processing Lane
RN/IBM 9WoNDP-2 12/14/2014 Figure from [IBM JRD]
Vector, Scalar and MaskRegisters
RN/IBM 10WoNDP-2 12/14/2014
Processing Lane
RN/IBM WoNDP-2 12/14/2014 11
Power-EfficientMicroarchitecture Direct path from memory to registers
Short latency to memory
Low pj/bit TSV interconnection between memory and processinglanes
No dynamic scheduling of instructions (pipeline exposed to program)
Explicit parallelism
Efficient vector register file
Chained operations to avoid intermediate writes to register file
Largely predecoded commands to various execution units, similar tohorizontal microcoding
No instruction cache
Aggressive gating of unused paths
RN/IBM WoNDP-2 12/14/2014 12
Instruction Set
RN/IBM WoNDP-2 12/14/2014 13
Sample AMC LaneInstruction
RN/IBM WoNDP-2 12/14/2014 14
LaneControl
Rep Slice 0 Slice 1 Slice 2 Slice 3
AL LS
[32] nop {f1mul vr3,vr2,vr1; nop} {x2add vr5,vr7,vr9; nop} {nop; nop} {nop; nop}
Perform operationon next 32elements.
Do not branch atend of instruction– just continue tonext instruction
DP multiply element fromvector register 2 of slice 0
with element in vectorregister 1 of slice 0. Result in
vector register 3 of slice 0.Bump register pointers to
next element in each vector
32-bit integer add2 elements from vector register 7 of
slice 1 with 2 elements in vectorregister 9 of slice 1. Results in
vector register 5 of slice 1. Bumpregister pointers to next element in
each vector
Same operations as above on registers in the scalar register file.Compiler cannot place dependent operation in next instruction.
[1] nop {f1mul sr3,sr2,sr1; nop} {x2add sr5,sr7,sr9; nop} {nop; nop} {nop; nop}
LSU Commands, includingstrides, gather, scatter
RN/IBM WoNDP-2 12/14/2014 15
[32] nop {nop; l8u vr3,sr2,sr1} {nop; l8 vr5,vr7,sr9} {nop; nop} {nop; nop}
LaneControl
Rep Slice 0 Slice 1 Slice 2 Slice 3
AL LS
Performoperation on
next 32elements
STRIDED LOAD:Load DP element into vector
register 3 of slice 0 from addressobtained by adding scalar
register 2 with scalar register 1.Next load will use the updated
value for s1
GATHER:Load DP element into vector register5 of slice 1 from address provided by
element of vector register 7incremented by scalar value in sr9.
Bump vector register to point to nextelement for next address.
Bandwidth and Latency(Sequential Access)
RN/IBM WoNDP-2 12/14/2014 16Figure from [IBM JRD]
Improvement withOpen-Page Mode
RN/IBM WoNDP-2 12/14/2014 17Figure from [IBM JRD]
Coherence
No caches within AMC
Granularity of transfer to lane– 8 bytes to 256 bytes
Coherence bit checked on access– Overhead only on small-granularity
stores
RN/IBM WoNDP-2 12/14/2014 18
DRAM Page: 1024 bytes + 1024 bits metadata
Line: 128 bytes + 128 bits metadata
Vault Block: 32 bytesMetadata
2 bits coherence30 bits ECC + other
AMC API
Goal:– Expose all hardware resources of interest to user
– Allow efficient sharing of AMC execution units
– Allow efficient user-mode access
Logical naming of resources: Physical mapping by OS– OS is minimal kernel, Compute Node Kernel (CNK), employed on BG/Q
Memory Placement: Allocation and relocation of memory
AMC Interface Segment – for lane code and initialized data
Data areas (with addresses in special-purpose registers)
– Common Area: for data common to all lanes
– Lane Stack: for data specific to a single lane
Lane Completion Mechanism
– Atomic fetch and decrement
– Wakeup host
RN/IBM WoNDP-2 12/14/2014 19
AMC Simulator
Supports all features of Instruction Set Architecture
Timing-correct at the AMC level, including internalinterconnect and vaults
Functionally correct at the host level, includingprivileged mode
OS and runtime accurately simulated
Provides support for power and reliabilitycalculation
Provides support for experimenting with differentmix of resources and with new features
RN/IBM WoNDP-2 12/14/2014 20
Compiling to the AMC
RN/IBM WoNDP-2 12/14/2014 21
Vector ISA
Binary
Annotated Program(OpenMP 4.0)
Binary
XL Compiler
Assembler
LLVM
User
User
AMC Compiler
Supports C, C++, Fortran front ends
User annotates program using OpenMP 4.0accelerator directives
– Identify code section to run on AMC
– Identify data region accessed by AMC
RN/IBM WoNDP-2 12/14/2014 22
Compiled Applications
DGEMM
DAXPY
Determinant
LULESH
SNAP
Nekbone
UMT
CoMD
RN/IBM WoNDP-2 12/14/2014 23
Challenges/Opportunitiesfor Compiler Exploitation
Heterogeneity
Programmable length vectors
Gather-scatter
Slice-mapping
Predication using mask registers
Scalar-vector combination code
Memory latency hiding
Instruction Buffer size
RN/IBM WoNDP-2 12/14/2014 24
Techniques Used
Polyhedral framework
Loop nests analyzed for vector exploitation– Loop blocking, loop versioning and loop unrolling applied
in integrated manner
Software I-cache
Limited source code transformation– LULESH Kernel 1: Distribute kernel into two loops to help
scheduling
– Nekbone: Manually inline multiply nested procedure calls
– Will be automated eventually
RN/IBM WoNDP-2 12/14/2014 25
Performance
DGEMM– C = C – A x B
– 83% peak performance:266 GF/AMC Hand assembled
77% through compiler
– 10 W power in 14 nmtechnology
– Roughly 20 GF/W atsystem level
– 7 nm projection: 56 GF/W at AMC
Roughly at target atsystem level
RN/IBM WoNDP-2 12/14/2014 26
Breakdown of Resources
RN/IBM WoNDP-2 12/14/2014 27
Area Breakdown Power Breakdown
Figure from [IBM JRD]
Bandwidth
DAXPY (Y = Y + a x X)
Performance depends on data striping
4.95 computations per cycle per AMC when stripedacross all lanes– Out of 32 possible
– Low because of bandwidth limitation
Improves to 7.66 computations per cycle when datais blocked within vault
Bandwidth utilized– 153.2 GB/s (read), 76.6 GB/s (write) per AMC
– 2.4 TB/s (read), 1.2 TB/s (write) across node
RN/IBM WoNDP-2 12/14/2014 28
Power Distribution
RN/IBM WoNDP-2 12/14/2014 29
Power ConsumptionAcross Applications
Real applications tend to be less power-constrained
RN/IBM WoNDP-2 12/14/2014 30
AMC Power 14nm (32 lanes)
0
2
4
6
8
10
12
dgemm_base
dgemm_atcomb
Nekbone_IP
Nekbone_DAXPY
Nekbone_AX_e
Lulesh1_VOL
COMD_LJ
COMD_EAM
Kernel
Pow
er(W
)
links clkgridleakage lanesnetwork dram
Concluding Remarks
3D technology is reviving interest inProcessing-in-Memory
The AMC design demonstrates a way tomeet DoE Exascale requirements of1 Exaflops in 20 MW
– An AMC-like PIM approach may be the only wayto achieve this target in 7 nm CMOS technology
Widespread use of the AMC technology willdepend on commodity adoption of such3D memory+logic stacks
RN/IBM WoNDP-2 12/14/2014 31
Thank you!
This work was supported in part by Department of Energy ContractB599858 under the FastForward initiative
– PIs: Ravi Nair and Jaime Moreno
Collaborators
– S. F. Antao, C. Bertolli, P. Bose, J. Brunheroto, T. Chen, C.-Y. Cher,C. H. A. Costa, J. Doi, C. Evangelinos, B. M. Fleischer, T. W. Fox,D. S. Gallo, L. Grinberg, J. A. Gunnels, A. C. Jacob, P. Jacob,H. M. Jacobson, T. Karkhanis, C. Kim, J. K. O'Brien, M. Ohmacht, Y. Park,D. A. Prener, B. S. Rosenburg, K. D. Ryu, O. Sallenave, M. J. Serrano,P. Siegl, K. Sugavanam, Z. Sura
Upcoming Paper[IBM JRD] R. Nair et al, “Active Memory Cube: A Processing-in-Memory
Architecture for Exascale Systems”, in IBM Journal of Research &Development, vol. 59, no. 2/3, 2015.
IBM, BG/Q, Blue Gene/Q and Active Memory Cube are trademarks of International BusinessMachines Corp., registered in many jurisdictions worldwide.
RN/IBM WoNDP-2 12/14/2014 32