View
4
Download
0
Category
Preview:
Citation preview
18-447 Computer Architecture
Lecture 30: In-memory Processing
Vivek Seshadri Carnegie Mellon University Spring 2015, 4/13/2015
Goals for This Lecture
• Understand DRAM technology – How it is built? – How it operates? – What are the trade-‐offs?
• Can we use DRAM for more than just storage? – In-‐DRAM copying – In-‐DRAM bitwise operaCons
2
DRAM Module and Chip
3
Goals of DRAM Design
• Cost • Latency • Bandwidth • Parallelism • Power • Energy • Reliability
4
DRAM Chip
5
Bank
DRAM Cell – Capacitor
6
Empty State Fully Charged State
Logical “0” Logical “1”
1
2
Small – Cannot drive circuits
Reading destroys the state
Sense Amplifier
7
enable
top
bo*om
Inverter
Sense Amplifier – Two Stable States
8
en en
0
0 VDD
VDD
Logical “1” Logical “0”
Sense Amplifier OperaMon
9
dis
VT
VB
VT > VB en
0
VDD
Capacitor to Sense Amplifier
10
en
0
VDD
en
VDD
0 ?
DRAM Cell OperaMon
11
½VDD
½VDD
dis en
0
VDD ½VDD+δ Cell loses charge
Cell regains charge
AmorMzing Cost – DRAM Tile
12
Row Driv
er
DRAM Subarray
13
Row Driv
er
Tile Tile Tile
Row Decod
er
DRAM Subarray
14
Row Driv
er
Tile Tile Tile
Row Decod
er
Tile Tile Tile Tile
DRAM Bank
15
Row Decod
er
Array of Sense Amplifiers (8Kb)
Cell Array
Cell Array
Row Decod
er
Array of Sense Amplifiers
Cell Array
Cell Array
Bank I/O (64b)
Address
Address Data
DRAM Chip
16
Shared internal bus
Memory channel -‐ 8bits
Row Decoder
Array of Sense Amplifiers (8Kb)
Cell Array
Cell Array
Row Decoder
Array of Sense Amplifiers
Cell Array
Cell Array
Bank I/O (64b)
Row Decoder
Array of Sense Amplifiers (8Kb)
Cell Array
Cell Array
Row Decoder
Array of Sense Amplifiers
Cell Array
Cell Array
Bank I/O (64b)
Row Decoder
Array of Sense Amplifiers (8Kb)
Cell Array
Cell Array
Row Decoder
Array of Sense Amplifiers
Cell Array
Cell Array
Bank I/O (64b)
Row Decoder
Array of Sense Amplifiers (8Kb)
Cell Array
Cell Array
Row Decoder
Array of Sense Amplifiers
Cell Array
Cell Array
Bank I/O (64b)
Row Decoder
Array of Sen
se Amplifiers (8K
b)
Cell Array
Cell Array
Row Decoder
Array of Sen
se Amplifiers
Cell Array
Cell Array
Bank
I/O (6
4b)
Row Decoder
Array of Sen
se Amplifiers (8K
b)
Cell Array
Cell Array
Row Decoder
Array of Sen
se Amplifiers
Cell Array
Cell Array
Bank
I/O (6
4b)
Row Decoder
Array of Sen
se Amplifiers (8K
b)
Cell Array
Cell Array
Row Decoder
Array of Sen
se Amplifiers
Cell Array
Cell Array
Bank
I/O (6
4b)
Row Decoder
Array of Sen
se Amplifiers (8K
b)
Cell Array
Cell Array
Row Decoder
Array of Sen
se Amplifiers
Cell Array
Cell Array
Bank
I/O (6
4b)
DRAM OperaMon
17
Row Decod
er
Row Decod
er
Array of Sense Amplifiers
Cell Array
Cell Array
Bank I/O Data
1
2
ACTIVATE Row
READ/WRITE Column
3 PRECHARGE
Row Add
ress
Column Address
Goals for This Lecture
• Understand DRAM technology – How it is built? – How it operates? – What are the trade-‐offs?
• Can we use DRAM for more than just storage? – In-‐DRAM copying – In-‐DRAM bitwise operaCons
18
Trade-‐offs in DRAM Design
• Cost • Latency • Bandwidth • Parallelism • Power • Energy • Reliability
19
— Rows/Subarray — Data width, Chips/DIMM — Banks/Chip
Goals for This Lecture
ü Understand DRAM technology – How it is built? – How it operates? – What are the trade-‐offs?
• Can we use DRAM for more than just storage? – In-‐DRAM copying – In-‐DRAM bitwise operaCons
20
RowClone Fast and Energy-‐Efficient In-‐DRAM Bulk Data Copy and IniMalizaMon
Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun,
G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry
Vivek Seshadri
Memory Channel – Bo\leneck
Core
Core
Cache
MC
Mem
ory
Channel
Limited Bandwidth
High Energy
Goal: Reduce Memory Bandwidth Demand
Core
Core
Cache
MC
Mem
ory
Channel
Reduce unnecessary data movement
Bulk Data Copy and IniMalizaMon
Bulk Data Copy
Bulk Data IniMalizaMon
src dst
dst val
Bulk Data Copy and IniMalizaMon
Bulk Data Copy
Bulk Data IniMalizaMon
src dst
dst val
Bulk Copy and IniMalizaMon – ApplicaMons
Forking
000000000000000
Zero iniMalizaMon (e.g., security)
VM Cloning DeduplicaMon
CheckpoinMng
Page MigraMon
Many more
Shortcomings of ExisMng Approach
Core
Core
Cache
MC Channel src
dst
High latency (1046ns to copy 4KB)
Interference
High Energy (3600nJ to copy 4KB)
Our Approach: In-‐DRAM Copy with Low Cost
Core
Core
Cache
MC Channel dst
High latency
Interference
High Energy
src
X
X
X
?
RowClone: In-‐DRAM Copy
29
Bulk Copy in DRAM – RowClone
30
½VDD
½VDD
0 1
0
VDD ½VDD +δ
Data gets copied
Fast Parallel Mode – Benefits
31
Latency Energy
Bulk Data Copy (4KB across a module)
1046ns to 90ns 3600nJ to 40nJ
No bandwidth consumpMon
Very li\le changes to the DRAM chip
11X 74X
Fast Parallel Mode – Constraints
• LocaCon constraint – Source and desCnaCon in same subarray
• Size constraint – EnCre row gets copied (no parCal copy)
32
1
2
Can sCll accelerate many exisCng primiCves (copy-‐on-‐write, bulk zeroing)
Alternate mechanism to copy data across banks (pipelined serial mode – lower benefits than Fast Parallel)
End-‐to-‐end System Design
• Soeware interface – memcpy and meminit instrucCons
• Managing cache coherence – Use exisCng DMA support!
• Maximizing use of Fast Parallel Mode – Smart OS page allocaCon
33
ApplicaMons Summary
34
0
0.2
0.4
0.6
0.8
1
bootup compile forkbench mcached mysql shell
Frac
tion
of M
emor
y Tr
affic
Zero Copy Write Read
Results Summary
35
0%
10%
20%
30%
40%
50%
60%
70%
bootup compile forkbench mcached mysql shell
Com
pare
d to
Bas
elin
e
IPC Improvement Memory Energy Reduction
Goals for This Lecture
ü Understand DRAM technology – How it is built? – How it operates? – What are the trade-‐offs?
• Can we use DRAM for more than just storage? – In-‐DRAM copying – In-‐DRAM bitwise operaCons
36
Triple Row AcMvaMon
37
½VDD
½VDD
dis
A
B
C
Final State AB + BC + AC
½VDD+δ
C(A + B) + ~C(AB) en
0
VDD
In-‐DRAM Bitwise AND/OR
Required OperaCon: Perform a bitwise AND of two rows A and B and store the result in C • R0 – reserved zero row, R1 – reserved one row • D1, D2, D3 – Designated rows for triple acCvaCon
1. RowClone A into D1 2. RowClone B into D2 3. RowClone R0 into D3 4. ACTIVATE D1,D2,D3 5. RowClone Result into C
38
Throughput Results
39
0 10 20 30 40 50 60 70 80 8K
B
16KB
32KB
64KB
128K
B
256K
B
512K
B
1MB
2MB
4MB
8MB
16MB
32MB AN
D/OR Th
roughp
ut (G
B/s)
Size of vectors involved in AND/OR
Intel-‐AVX (one core) Our Proposal (Aggressive) (one bank)
L1
L2 L3 Memory
Bitmap Index • AlternaCve to B-‐tree and its variants • Efficient for performing range queries and joins
40
Bitm
ap 1
Bitm
ap 2
Bitm
ap 4
Bitm
ap 3
age < 18 18 < age < 25 25 < age < 60 age > 60
Performance EvaluaMon
41
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
3 9 20 45 98 118 128
Performan
ce RelaM
ve to
Ba
selin
e
Number of OR bins
ConservaMve (1 Bank) Aggressive (1 Bank)
ConservaMve (4 Banks) Aggressive (4 Banks)
Goals for This Lecture
ü Understand DRAM technology – How it is built? – How it operates? – What are the trade-‐offs?
ü Can we use DRAM for more than just storage? – In-‐DRAM copying – In-‐DRAM bitwise operaCons
42
Recommended