Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, … › ~cs7810 › pres ›...

Lecture 3: Memory Buffers and Scheduling

• Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM),

memory blades, scheduling policies

HMC Details

• 32 banks per die x 8 dies = 256 banks per package

• 2 banks x 8 dies form 1 vertical slice (shared data bus)

• High internal data bandwidth (TSVs) entire cache line

from a single array (2 banks) that is

256 bytes wide

• Future generations: eight links that

can connect to the processor or

other HMCs – each link (40 GBps)

has 16 up and 16 down lanes

(each lane has 2 differential wires)

• 1866 TSVs at 60 um pitch and

2 Gb/s (50 nm 1Gb DRAMs)

• 3.7 pJ/bit for DRAM layers and 6.78 pJ/bit for logic layer

(existing DDR3 modules are 65 pJ/bit)

Figure from: T. Pawlowski, HotChips 2011

Modern Memory System

........

• 4 DDR3 channels

• 64-bit data channels

• 800 MHz channels

• 1-2 DIMMs/channel

• 1-4 ranks/channel

Cutting-Edge Systems

PROCSMB

• The link into the processor is narrow and high frequency

• The Scalable Memory Buffer chip is a “router” that connects

to multiple DDR3 channels (wide and slow)

• Boosts processor pin bandwidth and memory capacity

• More expensive, high power

Buffer-on-Board Examples

From: Cooper-Balis et al., ISCA 2012

FB-DIMM

100 W for a fully-populated FB-DIMM channel under high load.

Processor

FB-DIMM

……

FB-DIMM architecture; up to 8 FB-DIMMs can be daisy-chained through AMBs.

/10 bit

14 bit/

LR-DIMM

Figure from: Yoon et al., ISCA 2012

BOOM Yoon et al., ISCA 2012

• Using LPDDR chips

at low frequency

• Wide internal datapath

• Longer burst length

on DBUS

• No cache line buffering

• Higher activate power

• Same bandwidth, but

lower power and higher

latency

BOOM with Sub-Ranking

Disaggregated Memory Lim et al., ISCA 2009

• To support high capacity memory, build a separate memory

blade that is shared by all compute blades in a rack

• For example, if average utilization is 2 GB, each compute

blade is only provisioned with 2 GB of memory; but the

compute blade can also access (say) 2 TB of data in the

memory blade

• The hierarchy is exclusive and data is managed at page

granularity

• Remote memory access is via PCIe (120 ns latency and

1 GB/s in each direction)

Scheduling Policies Basics

• Must honor several timing constraints for each bank/rank

• Commands: PRE, ACT, COL-RD, COL-WR, REF,

Power-Up/Dn

• Must handle reads and writes on the same DDR3 bus

• Must issue refreshes on time

• Must maximize row buffer hit rates and parallelism

• Must maximize throughput and fairness

Address Mapping Policies

• Consecutive cache lines can be placed in the same row

to boost row buffer hit rates

• Consecutive cache lines can be placed in different ranks

to boost parallelism

• Example address mapping policies:

row:rank:bank:channel:column:blkoffset

row:column:rank:bank:channel:blkoffset

Reads and Writes

• A single bus is used for reads and writes

• The bus direction must be reversed when switching between

reads and writes; this takes time and leads to bus idling

• Hence, writes are performed in bursts; a write buffer stores

pending writes until a high water mark is reached

• Writes are drained until a low water mark is reached

Refresh

• Example, tREFI (gap between refresh commands) = 7.8us,

tRFC (time to complete refresh command) = 350 ns

• JEDEC: must issue 8 refresh commands in a 8*tREFI

time window

• Allows for some flexibility in when each refresh command

is issued

• Elastic refresh: issue a refresh command when there is

lull in activity; any unissued refreshes are handled at the

end of the 8*tREFI window

Maximizing Row Buffer Hit Rates

• FCFS: Issue the first read or write in the queue that is

ready for issue (not necessarily the oldest in program order)

• First Ready - FCFS: First issue row buffer hits if you can

STFM Mutlu and Moscibroda, MICRO’07

• When multiple threads run together, threads with row

buffer hits are prioritized by FR-FCFS

• Each thread has a slowdown: S = Talone / Tshared, where T is

the number of cycles the ROB is stalled waiting for memory

• Unfairness is estimated as Smax / Smin

• If unfairness is higher than a threshold, thread priorities

override other priorities (Stall Time Fair Memory scheduling)

• Estimation of Talone requires some book-keeping: does an

access delay critical requests from other threads?

PAR-BS Mutlu and Moscibroda, ISCA’08

• A batch of requests (per bank) is formed: each thread can

only contribute R requests to this batch; batch requests

have priority over non-batch requests

• Within a batch, priority is first given to row buffer hits, then

to threads with a higher “rank”, then to older requests

• Rank is computed based on the thread’s memory intensity;

low-intensity threads are given higher priority; this policy

improves batch completion time and overall throughput

• By using rank, requests from a thread are serviced in

parallel; hence, parallelism-aware batch scheduling

TCM Kim et al., MICRO 2010

• Organize threads into latency-sensitive and bw-sensitive

clusters based on memory intensity; former gets higher

priority

• Within bw-sensitive cluster, priority is based on rank

• Rank is determined based on “niceness” of a thread and

the rank is periodically shuffled with insertion shuffling or

random shuffling (the former is used if there is a big gap in

niceness)

• Threads with low row buffer hit rates and high bank level

parallelism are considered “nice” to others

Minimalist Open-Page Kaseridis et al., MICRO 2011

• Place 4 consecutive cache lines in one bank, then the next

4 in a different bank and so on – provides the best balance

between row buffer locality and bank-level parallelism

• Don’t have to worry as much about fairness

• Scheduling first takes priority into account, where priority

is determined by wait-time, prefetch distance, and MLP in

thread

• A row is precharged after 50 ns, or immediately following

a prefetch-dictated large burst

Other Scheduling Ideas

• Using reinforcement learning Ipek et al., ISCA 2008

• Co-ordinating across multiple MCs Kim et al., HPCA 2010

• Co-ordinating requests from GPU and CPU

Ausavarungnirun et al., ISCA 2012

• Several schedulers in the Memory Scheduling

Championship at ISCA 2012

• Predicting the number of row buffer hits Awasthi et al., PACT 2011

• Bullet

Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, … › ~cs7810 › pres ›...

Documents

8GB DDR3 VLP RDIMM · Tambient= 27C 7m/s and Tambient=6 1C Vertical Orientation [Max Tc, °C] Thermal Requirement [Tcase °C] Power Dissipation W Part Comments: 1. All Components

X11DGO-T - supermicro.com · SATA connectors, ten SATA 3.0 connections, two 10GbE LAN ports, and up to 3 TB 3DS LRDIMM/LRDIMM/3DS RDIMM/RDIMM DDR4 ECC 2666/2400/2133 MHz memory in

Intel validated DDR4-2666 RDIMMS Skylake-SP · DDR4 2666 1.2V RDIMM System-Level Validation Results on Intel® Xeon® Scalable Processors listed below are the results from a small

HP ProLiant DL120 Generation 6 servidor...Supports up to 16GB, using 4GB Quad-Rank PC3-8500R DDR3 Registered (RDIMM) memory, operating at 1066MHz when fully populated at 2 DIMMs per

DDR4 (PC4) ECC RDIMM VP9MRxx72x4xxx - Digi-Key Sheets/Viking PDFs... · B 5/28/15 Update Block diagram ... DDR4 288-pin RDIMM Pin Wiring ... 16 DQ8 160 VSS 67 VDD 211 A7 117 DQ52

DDR4 SDRAM RDIMM - Micron Technology/media/Documents/Products/Data Sheet... · DDR4 SDRAM RDIMM MTA9ASF51272PZ – 4GB ... (SPD) EEPROM • 16 internal ... Base device: MT40A512M8,1

DRAM Memory Access Protocolscs7810/pres/dram-cs7810-protocolx2.pdf · P a g e ‹ # › 1 CS7810 School of Computing University of Utah DRAM Memory Access Protocols develop generic

CS7810 Prefetching

DDR3 SDRAM RDIMM Fast data transfer rates: PC3-12800, …datasheet.iiic.cc/datasheets-1/micron_technology/MT36JSF1G72PZ-1G… · Table 2: Addressing Parameter 8GB Refresh count 8K

DRAM Memory Controllerscs7810/pres/dram-cs7810-mem-ctlr-x2.pdf · DRAM Memory Controllers Reference: “Memory Systems: Cache, DRAM, Disk Bruce Jacob, Spencer Ng, & David Wang Today’s

DRAM Overview & Devicescs7810/pres/dram-cs7810x2.pdf · Simplified DRAM Orthogonal address to save pins & cost Sense amps now combined with row buffer . P a g e ‹ # › 7 CS7810

SDRAM RDIMM 168-Pin, 128MB, 256MB, x72 Data SheetModule ranks 1 (S0#, S2#) 1 (S0#, S2#) Table 3: Part Numbers and Timing Parameters – 128MB Modules Base device: MT48LC16M8A2,1 128Mb

DDR3 SDRAM RDIMM - Markit · Table 2: Addressing Parameter 4GB Refresh count 8K Row address 16K A[13:0] Device bank address 8 BA[2:0] Device configuration 1Gb …

New Doc 2017-11-07 (1) · 1 2.1 Intel XEON E5-2630v4 10-Core Processor 2 2.20 GHz 2.2 Intel C610 Series Chipset 2.3 192 GB DDR4 RDIMM LRDIMM 3TB 2.4 u Driver, Firmware, Software Management

IMDB NDP Advances - SNIAIntroduction of 16Gb, 4 high TSV, DDR4, x16 chips, 128GB Front Side LRDIMM • Maximum Persistent Memory portion enabled by: introduction of 3D XPoint memory,

DRAM Memory Access Protocolscs7810/pres/dram-cs7810-protocolx2.pdf · University of Utah Compound Commands •DRAM evolution allows compound commands »mem_ctlr options and scheduling

DDR SDRAM RDIMM 184-Pin, 1GB, 2GB x72, Data Sheet

DDR2 SDRAM Registered DIMM (RDIMM) Sheets/Micron Technology Inc PDFs... · ddr2 sdram registered dimm (rdimm) mt36htf25672(p) – 2gb mt36htf51272(p) – 4gb ... 4 dq1 34 dq25 64

FlashStation FS3400 - Synology Inc. · XenServer ® и OpenStack Cinder • ®Восьмиядерный процессор Intel Xeon® D-1541 и память DDR4 ECC RDIMM емкостью

SKY-8211 2U High Performance Edge Server based Xeon ...€¦ · 04/10/2018 · Technology Up to 6x 2666MHz DDR4 ECC Standard ECC RDIMM/LRDIMM Max Capacity Max capacity per channel: