Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012

Buffer-On-Board Memory System

1

Name: AurangozebISCA 2012

2

Outline

• Introduction• Modern Memory System• Buffer-On-Board (BOB) Memory System• BOB Simulation Suite• BOB Simulation Result

• Limit-Case Simulation• Full System Simulation

• Conclusion

3

Introduction (1/2)

• Modification of Memory system to cope with high speed.• Dual Inline Memory Module (DIMM) : <100 MHz speed. • Signal Integrity (i.e. Cross-talk, Reflection) issue at high speed of operation.

Reduce no. of DIMM to increase CLK speed. Limits the total capacity

• One Simple solution: • Increase capacity of single DIMM• Drawback: Difficult to decrease DRAM capacitor size. Cost does not scale linearly

4

Introduction (2/2)

• FB-DIMM Memory Solution: • Advanced Memory Buffer (AMB) with DDRx DRAM

to interpret packetized protocol and issue DRAM specific command.

• Support fast and slow speed of operation.• Drawback:

High speed I/O of AMB: Heat & Power issue Not cost effective

• Solution from IBM / INTEL / AMD : • A single logic chip. Not for one logic chip per FB-

DIMM• Control DRAM and communicate with CPU over a

relatively faster and narrow bus.• New architecture using low cost DIMMs

5

Modern Memory System

• Consideration• Ranks of memory per channel• DRAM type • No. of channels per processor

6

Buffer-On-Board (BOB) Memory System (1/2)

• Multiple BOB Channels• Each Channel consists of LR-, R-,

or U-DIMMs• Single & Simple controller for each

channel• Faster and Narrower bus (Link Bus)

between simple controller and CPU

7

Buffer-On-Board (BOB) Memory System (2/2)

• Operation:• Request Packet over link bus: Address + Req. Type + Data

(if write)• Translate Request into DRAM specific command (ACTIVATE,

READ, WRITE etc.) and issue to DRAM Ranks.• A Command Queue: Dynamic Scheduling• Read Return Queue: Sorting after data receive• Response Packet contains: Data + Address of initial request.

• BOB controller:• Address mapping• Returning data to CPU/Cache• Packetizing Request• Interpret Response packets: From & To simple controller

• Encapsulation: to support narrower link bus• Use multiple clock to transmit total data.

• A cross-bar switch: Any port to any link bus.

8

BOB Simulation Suite

• Two Separate Simulators• Developed by authors and MARSSx86 • A multi-core x86 simulator developed at SUNY-Binghamton

• Cycle Based Simulator written in C++• Encapsulate: Main BOB, each BOB, Associated Link and

simple controller.• Two Modes

• Stand-alone: Request parameterization, Random address or trace file are issued to memory system

• Full system simulation: Receive Request from MARSSx86• Memory

• A DDR3-1066 (MT41J512M4-187E)• A DDR3-1333 device (MT41J1G4-15E), and • A DDR3-1600 device (MT41J256M4-125E)

ref.[16]

9

BOB Simulation Result

• Two Experiments:• A limit-case simulation: random address stream is issued into

a BOB memory system.• A full system simulation: an operating system is booted on an

x86 processor and applications are executed• Benchmark

• NAS parallel benchmarks• PARSEC benchmark suite [9]• STREAM.

• Emphasized multi-threaded applications to demonstrate the types of workloads this memory architecture is likely to encounter.

• Design tradeoffs: Costs such as total pin count, power dissipation, and physical space (or total DIMM count).

10

Limit-Case Simulation

• Optimal rank depth for each DRAM channel is between 2 and 4• If Return Queue is full, no further read or write.• A read return queue must have at least enough capacity for four

responses packets.

• Simple Controller & DRAM Efficiency

• Width and speed of buses optimization: No stall the DRAM• A read-to-write request ratio of approximately 2-to-1• Equations 1 & 2: Bandwidth required by each link bus to prevent

them from negatively impacting the efficiency of each channel.11


• Link Bus Configuration (1/2)

12


• Weighting the response link bus more than the request : May be ideal for some application

• Side-effect: Serializing the communication on unidirectional buses


13


• Multiple logically independent channels of DRAM to share the same link bus and simple controller

• Reduce costs such as pin-out, logic fabrication, and physical space.• Reduce the number of simple controllers

• Multi-Channel Optimization

14


• 8 DRAM channels, each with 4 ranks (32 DIMMs making 256 GB total)

• CPU has up to 128 pins which can be used for data lanes

• These lanes are operated at 3.2 GHz (6.4 Gb/s)

• Cost Constrained Simulations

15

Full System Simulations

• Optimal rank depth for each DRAM channel is between 2 and 4• If Return Queue is full, no further read or write.• A read return queue must have at least enough capacity for four

responses packets.

• Simple Controller & DRAM Efficiency

• Width and speed of buses optimization: No stall the DRAM• A read-to-write request ratio of approximately 2-to-1• Equations 1 & 2: Bandwidth required by each link bus to prevent

them from negatively impacting the efficiency of each channel.16



17


• Weighting the response link bus more than the request : May be ideal for some application

• Side-effect: Serializing the communication on unidirectional buses


18


• Multiple logically independent channels of DRAM to share the same link bus and simple controller

• Reduce costs such as pin-out, logic fabrication, and physical space.• Reduce the number of simple controllers

• Multi-Channel Optimization

19


• STREAM and mcol generate the greatest average• This is due to the request mix generated during region of interest• STREAM: 46% reads and 54% writes• mcol: 99% reads.

• Performance & Power Trade-offs

20


• Performance & Power Trade-offs

21


• Address & Channel Mapping

22



23



24

Conclusion

• A new memory architecture: Increase both speed and capacity.

• Intermediate logic between the CPU and DIMMs.• Verified by implementing two configurations:

• Limit-Case Simulation• Full System Simulation

• Queue depths, proper bus configurations, and address mappings are considered to achieve peak efficiency.

• Cost-constrained simulations are also performed.• The buffer-on-board architecture: An ideal near-term

solution.

Documents

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012