21
Versatile Refresh: Low Complexity Refresh Scheduling for High– throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer, Yi Lu (alizade, adelj)@stanford.edu, (dachuang, sundaes)@memoir-systems.com, [email protected] ACM Sigmetrics/Performance 2012

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

Embed Size (px)

Citation preview

Page 1: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

Versatile Refresh: Low Complexity Refresh Scheduling for High–

throughput Multi-banked eDRAM

Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer, Yi Lu

(alizade, adelj)@stanford.edu, (dachuang, sundaes)@memoir-systems.com, [email protected]

ACM Sigmetrics/Performance 2012

Page 2: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

What Is Embedded DRAM?

1. 2nd Most Common Embedded Memory• Consists of 1 Transistor, 1 Capacitor cell• 2X-3X denser than SRAM• 2X-4X slower than SRAM

2. Supported by Key ASIC and IP Vendors• IBM, TSMC, NEC, Mosys, ST

3. Used in a Number of Applications• Servers, Networking, Storage, Gaming, Mobile

4. Industry Examples• IBM's P7• Sony Playstations, Nintendo GameCube, Wii• Apple iPhone, Microsoft Zune HD, Xbox 360• Cisco Catalyst 3K-10K

eDRAM 1T1C Memory Cell

Data

Select

StorageCapacitor

2

Page 3: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

R/W

Po

rt

Re

fre

sh

Po

rt

Bank

1

R

Ro

ws

DRAM Capacitor has Finite Retention Time (W = Tref)

Problem: eDRAM Refresh Causes Memory Bandwidth Loss

Example: W= 18us @ 100C = 4050 cycles @ 225 MHz

Example: R = 64 rows

Solution: Periodic Refresh --- Reserve Refresh Cycles for Every Cell in MemoryCauses Bandwidth Loss = R/W = 64 rows/4050 cycles ~ 1.58%

All 64 rows will losedata in 4050 cycles!

3

Page 4: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Trend: Higher Density Multi-banked Macros (Mb/mm2)

(2) More Banks are Packed Together and Need

to be Refreshed

Sh

are

d R

efr

es

h a

nd

R

/W P

ort

s

1

M

1 2 B

1

R

Memory Banks

Ro

ws

(3) Smaller Capacitor with

Lower Geometry → Smaller W

(1) More Rows are Packed Together and Need

to be Refreshed

(4) Smaller W with Higher Temperature

(5) Low Clock Speed Mode Decreases ‘Clock-time’ to Refresh

4

Periodic Refresh Bandwidth Loss ~ RB/W (Note: W ≥ RB)

Shared Circuitry to Conserve Area

Does not Scale with Larger Macros, Geometry & Low Power Modes

Page 5: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Examples of Periodic Refresh with Multi-banked Macros

5

M=1(Ports)

W (@ 100C)

RRows

BBanks

Periodic Refresh Loss

Normal(250 MHz)

18.2 us = 4050 cycles

64 8 12.6%

128 8 25%

Low Power(150 MHz)

18.2 us = 2699 cycles

64 8 18.9%

128 8 38%

M = Number of Memory Ports, B = Number of Banks, R = Number of Rows, W = Cell Retention Time

The Problem is Only Getting Worse Over Time …

Page 6: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Vendor Solution: Concurrent Refresh

R/W

Po

rts 1

M

1 2 B

1

R

Memory Banks

Ro

ws

Concurrent Refresh++: Refresh a Bank Which is Not Being Concurrently Accessed

++T. Kirihata et. al., An 800-MHz embedded DRAM with a concurrent refresh mode. Solid-State Circuits, IEEE Journal of, 40(6):1377–1387, June 2005.

Ref

resh

Po

rt

Concurrent Refresh Port

6

Page 7: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

How is Concurrent Refresh Used Today?

7

RP3

RP1

RP2

RP4

RP16

1 2 BMemory Banks

Next Concurrent

Refresh Pointer

Bank 2

Deficit Register Tracks Non-refreshed

Bank(s)

Our Observation: This is Incorrect! In Some Cases, Refresh Overhead can be Very Bad, Close to 100% for ANY Concurrent Refresh Scheduler

Deficit Register

3

Count

Standard Observation: N-1 out of N Banks Get Refreshed for Any PatternConcurrent Refresh Overhead is Proportional to 1 bank

Concurrent Refresh Overhead = R/W = 64 rows /4050 cycles = ~1.58%

Accessed Bank

Page 8: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Goals of Our Work: An Industry Outlook

Design a Concurrent Refresh Scheduler that can

1. Provide Deterministic Memory Performance Guarantees− Maximize Memory Throughput (Optimality)

2. Be Universally Applicable− For any eDRAM macro with B banks, R Rows, M memory ports− For any characteristics of cell retention time W++, and Clock speed

3. Maximize Memory Burst Tolerance4. Have Low Implementation Overhead

8

++Note that W is itself a function of temperature, process, and the micro-architecture of the eDRAM

Page 9: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Problem Formulation

We consider a general class of algorithms that require X refresh (idle) timeslots in every Y consecutive timeslots.

9

Fixed TDM Constraint

Refresh Refresh Refresh RefreshRefresh

Refresh Window 1 Refresh Window 2 Refresh Window 3 Refresh Window 4

...... . . . . . ….

Sliding Window ConstraintSupports X idle cycles in any (t, t+Y)

Refresh Refresh

Any Refresh Window Any Refresh Window

Sliding Window Constraint Gives Maximum Flexibility for Handling Bursts, and When to Provide Idle Cycles

Page 10: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Key Performance Metrics

10

Refresh Overhead = X / Y• Memory bandwidth wasted on refresh

Burst Tolerance = Y – X• Maximum number of consecutive memory accesses

without interruption for refresh

We’ll Consider the Simple Case When the User is Required to Send X = 1 Idle in Y Cycles, and M = 1

Page 11: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Deficit Register

Pointer Count

Our Solution: Versatile Refresh Algorithm

11

1 2 BMemory Banks

Next Concurrent

0

Refresh Pointer

RP1

RP2

RP3

RP4

RPB

RP2

RP1 RP4

1

RPB

RP1

RP2

2

RP3

1

Bank with deficit has priority for refresh.

Maximum Allowed Deficit Register Controls Burst Tolerance (Y)

Max Register

3

Count

Page 12: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Necessary Refresh Overhead for any Algorithm: Intuition, X=1

At each time the BR memory cells have distinct ages ≥ (0, …, BR-1)

An adversary keeps reading from a particular bank; idle slots are needed to refresh cells in that bank.

A total of BR inequalities to ensure cells are refreshed in time

Interestingly, only two of these inequalities matter • The one corresponding to the oldest cell• The one corresponding to the oldest “youngest cell in each bank”

12

Page 13: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Necessary Refresh Overhead for any Algorithm: Derivation, X=1

How much can the adversary age the oldest cell?• Current age is at least BR-1 • Must wait for at least 1 idle before it is picked up: (BR -1) + Y ≤ W

How much can the adversary age the oldest “youngest cell in each bank”?• Current age is at least B-1 • Must wait for at least R idles before it is picked up: (B-1) + YR ≤ W

13

𝑂𝑉 𝑛𝑒𝑐=1𝑌≥ ( 𝑅𝑊 −(𝐵−1))

𝑂𝑉 𝑛𝑒𝑐=1𝑌≥ ( 1𝑊 −𝐵𝑅+1 )

Page 14: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Optimality for Versatile Refresh Overhead: Results, X =1

Necessity: Result for any Algorithm

Sufficiency: Result for VR Algorithm (with parameter X):

14

𝑂𝑉 𝑛𝑒𝑐 ≥𝑚𝑎𝑥 ( 1𝑊 −𝐵𝑅+1

,𝑅

𝑊 −(𝐵−1))¿𝑅𝑊

Nearly Optimal Refresh For X=1

𝑂𝑉 𝑠𝑢𝑓 ≤max ( 𝑋𝑊 −𝐵𝑅

,𝑅

𝑊 −𝐵𝑋 −1 )

Page 15: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Performance Guarantees of Versatile Refresh Algorithm

Why Would We Ever Use Large X?

Refresh Overhead ~ R/W, for W large

RB Wc = RB + B-1

“Bad” Region with High Overhead

1/BWor

st-c

ase

Refr

esh

Ove

rhea

d(X

/Y)

0R/W

1

Near-optimal Refresh Overhead for

X = 1

Increasing X

15

Cell Retention Time (W)

Page 16: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Why Would We Ever Use Large X?

Because of Burst Tolerance (large X → large Y – X)• If memory accesses are bursty, refreshes can be hidden

There is a Critical Value of X for Max Burst Tolerance

Example: B = 16, R = 128, W = 2500

16

𝑋 𝑐=𝑚𝑖𝑛(𝑅 , ⌈ 𝑊𝐵 ⌉−𝑅)

Page 17: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Calculations for Customer ASIC++

17

Theoretical In Practice

Total Bandwidth 375 MHz 375 MHz

Versatile Refresh Formula 6825 > 257N 6825 > 257N

Versatile Refresh Constraint 1 in 26.55 1 in 26

Data-path 360 MHz 360 MHz

Refresh 14.12 MHz 14.42 MHz

Extra Bandwidth for CPU 0.88 MHz 0.58 MHz

R = 1024, B = 6 Banks, W = 18.2us @ 100C = 6825 cycles @ 375MHz(++Note that these numbers have been sanitized)

Page 18: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Versatile Refresh Enhancement

Enhancement:• No-conflict slot: A timeslot where the bank the VR

scheduler wants to refresh is not being accessed.• Any idle slot is a no-conflict slot; but not vice versa • For VR, no-conflict slots are as good as idle slots.

Observation: • This allows lower refresh overhead (possibly zero) for

non-adversarial memory access patterns

18

Page 19: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Deficit Register

Pointer Count

Fully Enhanced Versatile Refresh Algorithm

19

1 2 BMemory Banks

Next Refresh

2

Bank Pointer

RP1

RP2

RP3

RP4

RPB

RP2

RP1 RP4

RPB

RP1

RP2 RP3

Max Register

3

Count

Repeat for Multiple Memory

Ports (M)

Enforcer Module(User Logic)

No conflict feedback

X idles in Y timeslots

Page 20: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Simulation: Synthetic Statistical Workload

Parameter Alpha Controls Degree of Temporal Locality • alpha ~ 0 → always read from bank 1 (adversarial)• alpha ~ 1 → read from random banks (benign)

20

0.001 0.01 0.1 0.25 0.5 10123456789

Periodic Refresh VR (X = 4) VR (X = 128)

alpha

Refr

esh

Ove

rhea

d (%

)

VR with X = 4: Min worst-case overhead

(best for adversarial)

VR with X = 128: Max burst tolerance

(best for benign)

Refresh Overhead has Disappeared Completely!

Page 21: Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

ACM Sigmetrics/Performance 2012

Conclusion

With Versatile Refresh A Designer Can …

1. Exactly Calculate Available Memory Bandwidth− For any eDRAM macro with B banks, R Rows, M memory ports− For any characteristics of Temperature, W= Tref and Clock speed

2. Achieve Optimal Worst-case Memory Bandwidth3. Design for Large Burst Tolerance4. Potentially Eliminate Back-pressure

− Simplify associated complex design and verification

5. Maximize Best-case Memory Bandwidth6. Avail of a Formally Verified VR Controller

− On a suitably reduced memory instance

21