Upload
joel-byrd
View
221
Download
2
Embed Size (px)
Citation preview
Versatile Refresh: Low Complexity Refresh Scheduling for High–
throughput Multi-banked eDRAM
Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer, Yi Lu
(alizade, adelj)@stanford.edu, (dachuang, sundaes)@memoir-systems.com, [email protected]
ACM Sigmetrics/Performance 2012
ACM Sigmetrics/Performance 2012
What Is Embedded DRAM?
1. 2nd Most Common Embedded Memory• Consists of 1 Transistor, 1 Capacitor cell• 2X-3X denser than SRAM• 2X-4X slower than SRAM
2. Supported by Key ASIC and IP Vendors• IBM, TSMC, NEC, Mosys, ST
3. Used in a Number of Applications• Servers, Networking, Storage, Gaming, Mobile
4. Industry Examples• IBM's P7• Sony Playstations, Nintendo GameCube, Wii• Apple iPhone, Microsoft Zune HD, Xbox 360• Cisco Catalyst 3K-10K
eDRAM 1T1C Memory Cell
Data
Select
StorageCapacitor
2
ACM Sigmetrics/Performance 2012
R/W
Po
rt
Re
fre
sh
Po
rt
Bank
1
R
Ro
ws
DRAM Capacitor has Finite Retention Time (W = Tref)
Problem: eDRAM Refresh Causes Memory Bandwidth Loss
Example: W= 18us @ 100C = 4050 cycles @ 225 MHz
Example: R = 64 rows
Solution: Periodic Refresh --- Reserve Refresh Cycles for Every Cell in MemoryCauses Bandwidth Loss = R/W = 64 rows/4050 cycles ~ 1.58%
All 64 rows will losedata in 4050 cycles!
3
ACM Sigmetrics/Performance 2012
Trend: Higher Density Multi-banked Macros (Mb/mm2)
(2) More Banks are Packed Together and Need
to be Refreshed
Sh
are
d R
efr
es
h a
nd
R
/W P
ort
s
1
M
1 2 B
1
R
Memory Banks
Ro
ws
(3) Smaller Capacitor with
Lower Geometry → Smaller W
(1) More Rows are Packed Together and Need
to be Refreshed
(4) Smaller W with Higher Temperature
(5) Low Clock Speed Mode Decreases ‘Clock-time’ to Refresh
4
Periodic Refresh Bandwidth Loss ~ RB/W (Note: W ≥ RB)
Shared Circuitry to Conserve Area
Does not Scale with Larger Macros, Geometry & Low Power Modes
ACM Sigmetrics/Performance 2012
Examples of Periodic Refresh with Multi-banked Macros
5
M=1(Ports)
W (@ 100C)
RRows
BBanks
Periodic Refresh Loss
Normal(250 MHz)
18.2 us = 4050 cycles
64 8 12.6%
128 8 25%
Low Power(150 MHz)
18.2 us = 2699 cycles
64 8 18.9%
128 8 38%
M = Number of Memory Ports, B = Number of Banks, R = Number of Rows, W = Cell Retention Time
The Problem is Only Getting Worse Over Time …
ACM Sigmetrics/Performance 2012
Vendor Solution: Concurrent Refresh
R/W
Po
rts 1
M
1 2 B
1
R
Memory Banks
Ro
ws
Concurrent Refresh++: Refresh a Bank Which is Not Being Concurrently Accessed
++T. Kirihata et. al., An 800-MHz embedded DRAM with a concurrent refresh mode. Solid-State Circuits, IEEE Journal of, 40(6):1377–1387, June 2005.
Ref
resh
Po
rt
Concurrent Refresh Port
6
ACM Sigmetrics/Performance 2012
How is Concurrent Refresh Used Today?
7
RP3
RP1
RP2
RP4
RP16
1 2 BMemory Banks
Next Concurrent
Refresh Pointer
Bank 2
Deficit Register Tracks Non-refreshed
Bank(s)
Our Observation: This is Incorrect! In Some Cases, Refresh Overhead can be Very Bad, Close to 100% for ANY Concurrent Refresh Scheduler
Deficit Register
3
Count
Standard Observation: N-1 out of N Banks Get Refreshed for Any PatternConcurrent Refresh Overhead is Proportional to 1 bank
Concurrent Refresh Overhead = R/W = 64 rows /4050 cycles = ~1.58%
Accessed Bank
ACM Sigmetrics/Performance 2012
Goals of Our Work: An Industry Outlook
Design a Concurrent Refresh Scheduler that can
1. Provide Deterministic Memory Performance Guarantees− Maximize Memory Throughput (Optimality)
2. Be Universally Applicable− For any eDRAM macro with B banks, R Rows, M memory ports− For any characteristics of cell retention time W++, and Clock speed
3. Maximize Memory Burst Tolerance4. Have Low Implementation Overhead
8
++Note that W is itself a function of temperature, process, and the micro-architecture of the eDRAM
ACM Sigmetrics/Performance 2012
Problem Formulation
We consider a general class of algorithms that require X refresh (idle) timeslots in every Y consecutive timeslots.
9
Fixed TDM Constraint
Refresh Refresh Refresh RefreshRefresh
Refresh Window 1 Refresh Window 2 Refresh Window 3 Refresh Window 4
...... . . . . . ….
Sliding Window ConstraintSupports X idle cycles in any (t, t+Y)
Refresh Refresh
Any Refresh Window Any Refresh Window
Sliding Window Constraint Gives Maximum Flexibility for Handling Bursts, and When to Provide Idle Cycles
ACM Sigmetrics/Performance 2012
Key Performance Metrics
10
Refresh Overhead = X / Y• Memory bandwidth wasted on refresh
Burst Tolerance = Y – X• Maximum number of consecutive memory accesses
without interruption for refresh
We’ll Consider the Simple Case When the User is Required to Send X = 1 Idle in Y Cycles, and M = 1
ACM Sigmetrics/Performance 2012
Deficit Register
Pointer Count
Our Solution: Versatile Refresh Algorithm
11
1 2 BMemory Banks
Next Concurrent
0
Refresh Pointer
RP1
RP2
RP3
RP4
RPB
RP2
RP1 RP4
1
RPB
RP1
RP2
2
RP3
1
Bank with deficit has priority for refresh.
Maximum Allowed Deficit Register Controls Burst Tolerance (Y)
Max Register
3
Count
ACM Sigmetrics/Performance 2012
Necessary Refresh Overhead for any Algorithm: Intuition, X=1
At each time the BR memory cells have distinct ages ≥ (0, …, BR-1)
An adversary keeps reading from a particular bank; idle slots are needed to refresh cells in that bank.
A total of BR inequalities to ensure cells are refreshed in time
Interestingly, only two of these inequalities matter • The one corresponding to the oldest cell• The one corresponding to the oldest “youngest cell in each bank”
12
ACM Sigmetrics/Performance 2012
Necessary Refresh Overhead for any Algorithm: Derivation, X=1
How much can the adversary age the oldest cell?• Current age is at least BR-1 • Must wait for at least 1 idle before it is picked up: (BR -1) + Y ≤ W
How much can the adversary age the oldest “youngest cell in each bank”?• Current age is at least B-1 • Must wait for at least R idles before it is picked up: (B-1) + YR ≤ W
13
𝑂𝑉 𝑛𝑒𝑐=1𝑌≥ ( 𝑅𝑊 −(𝐵−1))
𝑂𝑉 𝑛𝑒𝑐=1𝑌≥ ( 1𝑊 −𝐵𝑅+1 )
ACM Sigmetrics/Performance 2012
Optimality for Versatile Refresh Overhead: Results, X =1
Necessity: Result for any Algorithm
Sufficiency: Result for VR Algorithm (with parameter X):
14
𝑂𝑉 𝑛𝑒𝑐 ≥𝑚𝑎𝑥 ( 1𝑊 −𝐵𝑅+1
,𝑅
𝑊 −(𝐵−1))¿𝑅𝑊
Nearly Optimal Refresh For X=1
𝑂𝑉 𝑠𝑢𝑓 ≤max ( 𝑋𝑊 −𝐵𝑅
,𝑅
𝑊 −𝐵𝑋 −1 )
ACM Sigmetrics/Performance 2012
Performance Guarantees of Versatile Refresh Algorithm
Why Would We Ever Use Large X?
Refresh Overhead ~ R/W, for W large
RB Wc = RB + B-1
“Bad” Region with High Overhead
1/BWor
st-c
ase
Refr
esh
Ove
rhea
d(X
/Y)
0R/W
1
Near-optimal Refresh Overhead for
X = 1
Increasing X
15
Cell Retention Time (W)
ACM Sigmetrics/Performance 2012
Why Would We Ever Use Large X?
Because of Burst Tolerance (large X → large Y – X)• If memory accesses are bursty, refreshes can be hidden
There is a Critical Value of X for Max Burst Tolerance
Example: B = 16, R = 128, W = 2500
16
𝑋 𝑐=𝑚𝑖𝑛(𝑅 , ⌈ 𝑊𝐵 ⌉−𝑅)
ACM Sigmetrics/Performance 2012
Calculations for Customer ASIC++
17
Theoretical In Practice
Total Bandwidth 375 MHz 375 MHz
Versatile Refresh Formula 6825 > 257N 6825 > 257N
Versatile Refresh Constraint 1 in 26.55 1 in 26
Data-path 360 MHz 360 MHz
Refresh 14.12 MHz 14.42 MHz
Extra Bandwidth for CPU 0.88 MHz 0.58 MHz
R = 1024, B = 6 Banks, W = 18.2us @ 100C = 6825 cycles @ 375MHz(++Note that these numbers have been sanitized)
ACM Sigmetrics/Performance 2012
Versatile Refresh Enhancement
Enhancement:• No-conflict slot: A timeslot where the bank the VR
scheduler wants to refresh is not being accessed.• Any idle slot is a no-conflict slot; but not vice versa • For VR, no-conflict slots are as good as idle slots.
Observation: • This allows lower refresh overhead (possibly zero) for
non-adversarial memory access patterns
18
ACM Sigmetrics/Performance 2012
Deficit Register
Pointer Count
Fully Enhanced Versatile Refresh Algorithm
19
1 2 BMemory Banks
Next Refresh
2
Bank Pointer
RP1
RP2
RP3
RP4
RPB
RP2
RP1 RP4
RPB
RP1
RP2 RP3
Max Register
3
Count
Repeat for Multiple Memory
Ports (M)
Enforcer Module(User Logic)
No conflict feedback
X idles in Y timeslots
ACM Sigmetrics/Performance 2012
Simulation: Synthetic Statistical Workload
Parameter Alpha Controls Degree of Temporal Locality • alpha ~ 0 → always read from bank 1 (adversarial)• alpha ~ 1 → read from random banks (benign)
20
0.001 0.01 0.1 0.25 0.5 10123456789
Periodic Refresh VR (X = 4) VR (X = 128)
alpha
Refr
esh
Ove
rhea
d (%
)
VR with X = 4: Min worst-case overhead
(best for adversarial)
VR with X = 128: Max burst tolerance
(best for benign)
Refresh Overhead has Disappeared Completely!
ACM Sigmetrics/Performance 2012
Conclusion
With Versatile Refresh A Designer Can …
1. Exactly Calculate Available Memory Bandwidth− For any eDRAM macro with B banks, R Rows, M memory ports− For any characteristics of Temperature, W= Tref and Clock speed
2. Achieve Optimal Worst-case Memory Bandwidth3. Design for Large Burst Tolerance4. Potentially Eliminate Back-pressure
− Simplify associated complex design and verification
5. Maximize Best-case Memory Bandwidth6. Avail of a Formally Verified VR Controller
− On a suitably reduced memory instance
21