24
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

Embed Size (px)

Citation preview

Page 1: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

Buffer-On-Board Memory System

1

Name: AurangozebISCA 2012

Page 2: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

2

Outline

• Introduction• Modern Memory System• Buffer-On-Board (BOB) Memory System• BOB Simulation Suite• BOB Simulation Result

• Limit-Case Simulation• Full System Simulation

• Conclusion

Page 3: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

3

Introduction (1/2)

• Modification of Memory system to cope with high speed.• Dual Inline Memory Module (DIMM) : <100 MHz speed. • Signal Integrity (i.e. Cross-talk, Reflection) issue at high speed of operation.

Reduce no. of DIMM to increase CLK speed. Limits the total capacity

• One Simple solution: • Increase capacity of single DIMM• Drawback: Difficult to decrease DRAM capacitor size. Cost does not scale linearly

Page 4: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

4

Introduction (2/2)

• FB-DIMM Memory Solution: • Advanced Memory Buffer (AMB) with DDRx DRAM

to interpret packetized protocol and issue DRAM specific command.

• Support fast and slow speed of operation.• Drawback:

High speed I/O of AMB: Heat & Power issue Not cost effective

• Solution from IBM / INTEL / AMD : • A single logic chip. Not for one logic chip per FB-

DIMM• Control DRAM and communicate with CPU over a

relatively faster and narrow bus.• New architecture using low cost DIMMs

Page 5: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

5

Modern Memory System

• Consideration• Ranks of memory per channel• DRAM type • No. of channels per processor

Page 6: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

6

Buffer-On-Board (BOB) Memory System (1/2)

• Multiple BOB Channels• Each Channel consists of LR-, R-,

or U-DIMMs• Single & Simple controller for each

channel• Faster and Narrower bus (Link Bus)

between simple controller and CPU

Page 7: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

7

Buffer-On-Board (BOB) Memory System (2/2)

• Operation:• Request Packet over link bus: Address + Req. Type + Data

(if write)• Translate Request into DRAM specific command (ACTIVATE,

READ, WRITE etc.) and issue to DRAM Ranks.• A Command Queue: Dynamic Scheduling• Read Return Queue: Sorting after data receive• Response Packet contains: Data + Address of initial request.

• BOB controller:• Address mapping• Returning data to CPU/Cache• Packetizing Request• Interpret Response packets: From & To simple controller

• Encapsulation: to support narrower link bus• Use multiple clock to transmit total data.

• A cross-bar switch: Any port to any link bus.

Page 8: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

8

BOB Simulation Suite

• Two Separate Simulators• Developed by authors and MARSSx86 • A multi-core x86 simulator developed at SUNY-Binghamton

• Cycle Based Simulator written in C++• Encapsulate: Main BOB, each BOB, Associated Link and

simple controller.• Two Modes

• Stand-alone: Request parameterization, Random address or trace file are issued to memory system

• Full system simulation: Receive Request from MARSSx86• Memory

• A DDR3-1066 (MT41J512M4-187E)• A DDR3-1333 device (MT41J1G4-15E), and • A DDR3-1600 device (MT41J256M4-125E)

ref.[16]

Page 9: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

9

BOB Simulation Result

• Two Experiments:• A limit-case simulation: random address stream is issued into

a BOB memory system.• A full system simulation: an operating system is booted on an

x86 processor and applications are executed• Benchmark

• NAS parallel benchmarks• PARSEC benchmark suite [9]• STREAM.

• Emphasized multi-threaded applications to demonstrate the types of workloads this memory architecture is likely to encounter.

• Design tradeoffs: Costs such as total pin count, power dissipation, and physical space (or total DIMM count).

Page 10: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

10

Limit-Case Simulation

• Optimal rank depth for each DRAM channel is between 2 and 4• If Return Queue is full, no further read or write.• A read return queue must have at least enough capacity for four

responses packets.

• Simple Controller & DRAM Efficiency

Page 11: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

• Width and speed of buses optimization: No stall the DRAM• A read-to-write request ratio of approximately 2-to-1• Equations 1 & 2: Bandwidth required by each link bus to prevent

them from negatively impacting the efficiency of each channel.11

Limit-Case Simulation

• Link Bus Configuration (1/2)

Page 12: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

12

Limit-Case Simulation

• Weighting the response link bus more than the request : May be ideal for some application

• Side-effect: Serializing the communication on unidirectional buses

• Link Bus Configuration (2/2)

Page 13: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

13

Limit-Case Simulation

• Multiple logically independent channels of DRAM to share the same link bus and simple controller

• Reduce costs such as pin-out, logic fabrication, and physical space.• Reduce the number of simple controllers

• Multi-Channel Optimization

Page 14: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

14

Limit-Case Simulation

• 8 DRAM channels, each with 4 ranks (32 DIMMs making 256 GB total)

• CPU has up to 128 pins which can be used for data lanes

• These lanes are operated at 3.2 GHz (6.4 Gb/s)

• Cost Constrained Simulations

Page 15: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

15

Full System Simulations

• Optimal rank depth for each DRAM channel is between 2 and 4• If Return Queue is full, no further read or write.• A read return queue must have at least enough capacity for four

responses packets.

• Simple Controller & DRAM Efficiency

Page 16: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

• Width and speed of buses optimization: No stall the DRAM• A read-to-write request ratio of approximately 2-to-1• Equations 1 & 2: Bandwidth required by each link bus to prevent

them from negatively impacting the efficiency of each channel.16

Limit-Case Simulation

• Link Bus Configuration (1/2)

Page 17: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

17

Limit-Case Simulation

• Weighting the response link bus more than the request : May be ideal for some application

• Side-effect: Serializing the communication on unidirectional buses

• Link Bus Configuration (2/2)

Page 18: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

18

Limit-Case Simulation

• Multiple logically independent channels of DRAM to share the same link bus and simple controller

• Reduce costs such as pin-out, logic fabrication, and physical space.• Reduce the number of simple controllers

• Multi-Channel Optimization

Page 19: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

19

Full System Simulations

• STREAM and mcol generate the greatest average• This is due to the request mix generated during region of interest• STREAM: 46% reads and 54% writes• mcol: 99% reads.

• Performance & Power Trade-offs

Page 20: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

20

Full System Simulations

• Performance & Power Trade-offs

Page 21: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

21

Full System Simulations

• Address & Channel Mapping

Page 22: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

22

Full System Simulations

• Address & Channel Mapping

Page 23: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

23

Full System Simulations

• Address & Channel Mapping

Page 24: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

24

Conclusion

• A new memory architecture: Increase both speed and capacity.

• Intermediate logic between the CPU and DIMMs.• Verified by implementing two configurations:

• Limit-Case Simulation• Full System Simulation

• Queue depths, proper bus configurations, and address mappings are considered to achieve peak efficiency.

• Cost-constrained simulations are also performed.• The buffer-on-board architecture: An ideal near-term

solution.