10
Memory Access Pattern-Aware DRAM Performance Model for Multi-Core Systems Hyojin Choi , Jongbok Lee and Wonyong Sung School of Electrical Engineering Seoul National University 599 Gwanak-ro, Gwanak-gu, Seoul, 151-744 Korea {hjchoi@dsp., wysung@}snu.ac.kr Dept. of Information & Communications Engineering Hansung University 389 Samseon-dong 2-ga, Seongbuk-gu, Seoul, 136-792 Korea [email protected] Abstract—The DRAM latency modeling is complex because most chips contain row-buffers and multiple banks to exploit patterns of DRAM accesses. As a result, the latency of DRAM access not only depends on the circuit timing parameters but also memory access patterns. This study derives an analytical model that predicts the DRAM access performance using DRAM timing and memory access pattern parameters. As a performance metric, the bank busy time of DRAM is used. The pattern parameters employed represent memory access characteristics such as the number of row-buffer misses, the number of read or write requests that hit the row-buffers, etc. The proposed model not only relates the DRAM access performance with the memory access pattern but also provides information for timing optimization of next generation DRAMs. The model is evaluated with SPLASH-2 benchmark by using cycle-accurate timing simulations with DDR3 timings. The evaluation results show that, in memory-bounded cases, the execution time is limited by bank utilization, not by the data bus occupation ratio. I. I NTRODUCTION The speed gap between microprocessors and main memory systems, known as the processor-memory gap problem, has been a bottleneck for performance enhancement of computer systems [1][2]. The main memory access performance has been improved by increasing the DRAM clock frequency and also by enhancing the DRAM architecture. For instance, the clock frequency was 100 MHz about 10 years ago, and now its maximum is 1600 MHz for DDR3 [3]. Advanced DRAM architectures such as XDR and FB-DIMM have been developed for video game consoles and server systems [4][5]. As multi-core CPU based systems with only one main memory are popular, the main memory becomes increasingly important to provide sustainable performance enhancement of computer systems. Since most of current DRAM chips comtain row-buffers and multiple banks, the DRAM access performance is highly sensitive to program’s memory access behavior. An over- simplified analytical DRAM performance model may produce wrong design decisions. For example, the peak throughput, where it is assumed that all memory requests are read or write ones and always hit the row-buffers, is in direct proportion to DRAM clock frequency. One may increase DRAM clock frequency based on this simple model, but fails to achieve the predicted throughput. We do not have enough understanding on what features of memory access behavior are related with DRAM timing parameters. Moreover, in multi-core systems, memory access streams from simultaneously working threads are intermingled. As a result, it is very complicated not only to characterize memory access behavior, but also to predict DRAM access performance precisely. To address this issue, we propose an analytical model that captures the characteristics of main memory access streams and relates them with DRAM access performance. The per- formance metric of this model is the bank busy time, which is defined as the length of the time spent by a bank to serve the memory requests issued to that bank. The bank busy time is expressed as a linear combination of DRAM timing and pattern parameters. The pattern parameters employed represent the memory access characteristics such as the number of row- buffer misses, the number of read and write requests that hit the row-buffer, and so on. The pattern-awareness is the key feature that distinguishes the proposed model from the previous ones. Ahn et al.’s first-order achievable bandwidth model [6] assumes that all memory references are the row-buffer misses. Yuan et al.’s hybrid analytical model [7] predicts DRAM efficiency by profiling memory traces, and it does not provide a mathe- matical relationship between memory access characteristics and the performance metric. On the other hand, the proposed model enables analysis on how much each DRAM timing parameter affects the performance, as well as what features of memory access behavior are related with it. Also, the proposed model provides a simple and intuitive way to explain DRAM access performance in multi-core systems while the well-known performance metrics such as the memory access latency and the throughput that would involve the queuing 66 978-1-61284-368-1/11/$26.00 ©2011 IEEE

[IEEE Software (ISPASS) - Austin, TX, USA (2011.04.10-2011.04.12)] (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE - Memory access pattern-aware

  • Upload
    wonyong

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE Software (ISPASS) - Austin, TX, USA (2011.04.10-2011.04.12)] (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE - Memory access pattern-aware

Memory Access Pattern-Aware DRAM PerformanceModel for Multi-Core Systems

Hyojin Choi∗, Jongbok Lee† and Wonyong Sung∗

∗School of Electrical EngineeringSeoul National University

599 Gwanak-ro, Gwanak-gu, Seoul, 151-744 Korea{hjchoi@dsp., wysung@}snu.ac.kr

†Dept. of Information & Communications EngineeringHansung University

389 Samseon-dong 2-ga, Seongbuk-gu, Seoul, 136-792 [email protected]

Abstract—The DRAM latency modeling is complex becausemost chips contain row-buffers and multiple banks to exploitpatterns of DRAM accesses. As a result, the latency of DRAMaccess not only depends on the circuit timing parameters butalso memory access patterns. This study derives an analyticalmodel that predicts the DRAM access performance using DRAMtiming and memory access pattern parameters. As a performancemetric, the bank busy time of DRAM is used. The patternparameters employed represent memory access characteristicssuch as the number of row-buffer misses, the number of reador write requests that hit the row-buffers, etc. The proposedmodel not only relates the DRAM access performance withthe memory access pattern but also provides information fortiming optimization of next generation DRAMs. The model isevaluated with SPLASH-2 benchmark by using cycle-accuratetiming simulations with DDR3 timings. The evaluation resultsshow that, in memory-bounded cases, the execution time islimited by bank utilization, not by the data bus occupation ratio.

I. INTRODUCTION

The speed gap between microprocessors and main memorysystems, known as the processor-memory gap problem, hasbeen a bottleneck for performance enhancement of computersystems [1][2]. The main memory access performance hasbeen improved by increasing the DRAM clock frequencyand also by enhancing the DRAM architecture. For instance,the clock frequency was 100 MHz about 10 years ago, andnow its maximum is 1600 MHz for DDR3 [3]. AdvancedDRAM architectures such as XDR and FB-DIMM have beendeveloped for video game consoles and server systems [4][5].As multi-core CPU based systems with only one main memoryare popular, the main memory becomes increasingly importantto provide sustainable performance enhancement of computersystems.

Since most of current DRAM chips comtain row-buffersand multiple banks, the DRAM access performance is highlysensitive to program’s memory access behavior. An over-simplified analytical DRAM performance model may producewrong design decisions. For example, the peak throughput,

where it is assumed that all memory requests are read or writeones and always hit the row-buffers, is in direct proportionto DRAM clock frequency. One may increase DRAM clockfrequency based on this simple model, but fails to achieve thepredicted throughput. We do not have enough understandingon what features of memory access behavior are related withDRAM timing parameters. Moreover, in multi-core systems,memory access streams from simultaneously working threadsare intermingled. As a result, it is very complicated not onlyto characterize memory access behavior, but also to predictDRAM access performance precisely.

To address this issue, we propose an analytical model thatcaptures the characteristics of main memory access streamsand relates them with DRAM access performance. The per-formance metric of this model is the bank busy time, whichis defined as the length of the time spent by a bank to servethe memory requests issued to that bank. The bank busy timeis expressed as a linear combination of DRAM timing andpattern parameters. The pattern parameters employed representthe memory access characteristics such as the number of row-buffer misses, the number of read and write requests that hitthe row-buffer, and so on.

The pattern-awareness is the key feature that distinguishesthe proposed model from the previous ones. Ahn et al.’sfirst-order achievable bandwidth model [6] assumes that allmemory references are the row-buffer misses. Yuan et al.’shybrid analytical model [7] predicts DRAM efficiency byprofiling memory traces, and it does not provide a mathe-matical relationship between memory access characteristicsand the performance metric. On the other hand, the proposedmodel enables analysis on how much each DRAM timingparameter affects the performance, as well as what featuresof memory access behavior are related with it. Also, theproposed model provides a simple and intuitive way to explainDRAM access performance in multi-core systems while thewell-known performance metrics such as the memory accesslatency and the throughput that would involve the queuing

66978-1-61284-368-1/11/$26.00 ©2011 IEEE

Page 2: [IEEE Software (ISPASS) - Austin, TX, USA (2011.04.10-2011.04.12)] (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE - Memory access pattern-aware

MEMORY CONTROLLER

DRAM CHIP BANK 1

BANK 0

bus driver

addr

ess/

cont

rol l

ogic

ROW BUFFER

row

cell array

addr

/cm

d bu

s

data

bus

data busaddr/cmd bus

DIMM

Fig. 1. Main memory systems and DRAM architecture

theory, as shown in [8], do not show the relation very clearly.The rest of this paper is organized as follows. Section

II briefly explains DRAM architecture and introduces ourmodeling approach. Section III details the bank busy timemodel. Section IV presents evaluation results. The limitationof this work is discussed in Section V. Section VI concludesthis work.

II. MODELING OVERVIEW

A. DRAM Architecture

In this work, DDR3 DRAM architecture is considered formodeling, however, the results can be easily extended toother DRAM generations. A DRAM system usually containsmultiple banks, typically 4 or 8, to increase the throughput byservicing successive memory requests in parallel. As shownin Figure 1, each DRAM bank consists of a two-dimensionalstructure with rows and columns, called a cell array, and ithas one row-buffer where the entire row data delivered fromits cell array is stored to serve as a line cache. Due to thecache-like behavior of the row-buffer, a DRAM access canhave different latency according to the previous status of therow-buffer. When the access is made to the same row addresswith the buffered one, it takes short time because the data issimply read from the buffer. Otherwise, the entire row datashould be read from the cell array (activate operation). Theworst case happens when the buffer stores a different rowfrom the one to access, and the data in the row-buffer shouldbe written back to the cell array (precharge operation). Thefirst case is called the row-buffer hit, and the second and thelast ones are called the row-buffer miss. More specifically, thelast one is often called the row-buffer conflict.

Figure 2-(a) shows a timing diagram example of the DRAMoperation. It is assumed that the bank 0 and 1 have the row Aand X in the row-buffers, respectively, before the first memoryrequest arrives. The memory requests are denoted as Q1 ∼ Q6,

and their types (read or write) and addresses are summarizedin Fig. 2-(b). When a memory request arrives, the memorycontroller issues DRAM commands such as precharge, activateand read or write. For example, when the request Q1 arrives,the read command is issued to the bank 0 because its row-buffer has the row A as assumed. The request Q5 correspondsto a row-buffer miss because the bank 0 has the row A, andthe precharge, activate, and read commands are issued to thebank.

B. Approach

The number of DRAM commands needed to handle amemory request varies according to row-buffer hit or miss.The memory controller issues one column command (read orwrite) if the memory request hits the row-buffer. Whereas,three DRAM commands, precharge-activate-read or precharge-activate-write are issued to that bank for the row-buffermiss case. Note that open-policy is assumed for row-buffer-management. The row-buffer hit or miss is determined by theaddress of the incoming request and the status of the bank,which depends on the memory requests serviced previously. Itmakes DRAM access performance dependent on the memoryaccess pattern. Thus, an analytical DRAM performance modelshould consider the memory access pattern to predict DRAMaccess performance precisely.

Typically, the memory access pattern represents how thememory address spaces are accessed by processor cores[9][10][11], for example, sequential, random, and stride ones.However, the memory access pattern that applies to the mainmemory systems is much different from the one generated byprocessor cores. Especially, in multi-core systems, memoryaccesses from a processor core are filtered by a private cache,mixed with accesses from other ones, and then filtered againby a shared cache. Also, the order of memory accesses couldbe changed by the memory controller’s scheduling capability.As a result, it is extremely complex to find out the relationshipbetween the processor’s memory access characteristic and theDRAM access performance.

To simplify the problem, this paper focuses on DRAMcommands that are generated by the memory controller. Atfirst, the DRAM commands issued to a bank are analyzedand related to DRAM access performance. Then, the char-acteristics of memory requests, which are named as patternparameters, are derived from those of DRAM commandsconsidering the memory controller’s schemes assumed. Forexample, the number of read requests are the same withthat of read commands issued to DRAM. The number ofrow-buffer misses are the same with that of precharge oractivate commands if there is no refresh command issued.This approach characterizes the memory access streams thathave been affected by various architectural features such as thenumber of processor cores, the capacity of cache memories,and the memory controller’s address mapping scheme andmemory access scheduling algorithms.

67

Page 3: [IEEE Software (ISPASS) - Austin, TX, USA (2011.04.10-2011.04.12)] (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE - Memory access pattern-aware

DRAM CLK

R0a

CMD/ADDR

BUSYBANK 0

BANK 1

DATA BUS

R0b

BUSY

P1X

D1

0a0

0a2

0a4

0a6

0b0

0b2

0b4

0b6

BUSY BUSY

A1Y

BUSY

D2

D3

W1a

R0c

BUSY

1a0

1a2

1a4

1a6

0c0

0c2

0c4

0c6

P0A

BUSY

P1Y

BUSY

A0B

A1Z

BUSY

BUSY

BUSY

R0a

tCCD tCCD

tRP tRCD

D4

tCWL+tBL+tWR

tRTP tCCD

tRP tRCD

tRP tRCD

Pxy

Axy

: prechrage/activate (bank x, row y)Wxz

Rxz

: read/write (bank x, column z)

tRRD

Dx : time delay

Q1 Q2 Q3 Q4 Q5 Q6

xzi

: data transfer (bank x, column z, index i)

tCAS

(a) timing diagram

Q1 Q2 Q3 Q4 Q5 Q6type read read write read read readbank 0 0 1 0 0 1row A A Y A B Zcol a b a c a a

hit/miss hit hit miss hit miss miss

(b) memory access requests

classification descriptionD1 addr/cmd bus contention when P(1,X) is issuedD2 addr/cmd bus contention when R(0,c) is issuedD3 data bus contention between (1,a,4)(1,a,6) and (0,c,0)(0,c,2)D4 inter-bank timing constraint between A(0,B) and A(1,Z)

(c) time delay due to inter-bank interference

Fig. 2. Timing diagram example

A

DRAM CLK

CMD PtRP

BANK BUSY

(a) precharge-to-activate

RtCCD

R

BUSY

(b) read-to-read

RtRTP

P

BUSY

(c) read-to-precharge

DRAM CLK

CMD A

BANK BUSY

R PtRCD tRTP

tRASBUSY

(d) read-to-precharge

Fig. 3. Examples of minimum inter-command delays

C. Bank Busy Time

The performance metric that we consider is the bank busytime. A bank is said to be busy if it is not possible for thememory controller to issue any command to the bank due tothe timing constraints between the previously issued commandand the current one. The bank busy time is defined as thelength of the time spent by a bank to service the memoryrequests issued to that bank during program execution. When amemory request hits the row-buffer, the bank busy time is shortbecause the corresponding bank only handles that request. Onthe other hand, when a memory request misses the row-buffer,the bank should conduct precharge, activate and read (or write)operations, as a result, the bank busy time becomes long.

As the first step to model the bank busy time, the minimumtiming delay of each DRAM command is needed. However,it is not so straightforward to model the timing because theminimum time depends on the command that follows. Asshown in Fig. 3-(b), if two read commands are issued to abank, the minimum timing of the first one is tCCD. However,if the precharge command follows the read command, the

minimum timing of the read command is tRTP . Note thatthe precharge command can be issued after tRTP from theinstant when the read command is issued as described in theDRAM data sheet [12]. It implies that the minimum timingshould be defined for a pair of DRAM commands.

Moreover, the minimum timing becomes more complicatedwhen there are multiple timing constraints on a command.The precharge command in Fig. 3-(d) cannot be issued aftertRTP from the time instant when the read command is issueddue to the timing constraint, tRAS, between the activate andprecharge commands. In this case, the minimum timing of theread command is tRAS − tRCD as marked in the figure.In Section III, the possible DRAM command pairs are listedand the minimum time delays for them are determined. Then,the bank busy time is represented as a linear combination ofthe minimum time delays of DRAM command pairs and thenumber of their occurrences during program execution.

The bank busy time is similar to the memory access latency,but not exactly the same. Both explain the time delay fora memory request to serve by a bank. However, the bankbusy time does not include column access latencies, which aredenoted by tCAS and tCWL for read and write commands,respectively, because it concerns only whether the bank isavailable or not. Also, the bank busy time does not includethe queuing delay in the memory controller’s buffer, whichbecomes a dominant part of the memory access latency formemory-intensive cases. But, for computation-intensive cases,where there is only few pending request in the buffer, the bankbusy time becomes a dominant part of the memory accesslatency.

D. Bank Idle Time

In contrast to the bank busy time, the bank idle time isdefined as the time when any DRAM command can be issued

68

Page 4: [IEEE Software (ISPASS) - Austin, TX, USA (2011.04.10-2011.04.12)] (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE - Memory access pattern-aware

to the bank. The bank idle time can be classified into two cases.The first one is the time when there is no DRAM command tobe issued, called free-time in this paper. It happens when thereis no pending memory request for that bank in the memorycontroller’s buffer.

The second case is the time delay due to interference fromother banks. The cause of inter-bank interference falls into oneof the following categories:

1) addr/cmd bus contention2) data bus contention3) inter-bank timing constraints (tRRD and tFAW )

For example, time delays due to inter-bank interference aredenoted as D1 ∼ D4 in Fig. 2-(a) and (c). The prechargecommand for the request Q3 cannot be issued to bank 1because the addr/cmd bus is already occupied by the readcommand for bank 0 (D1). The read command to bank 0 forthe request Q4 is delayed because the write command to bank1 occupies the addr/cmd bus (D2), and it is delayed until thedata bus is available (D3). The activate command for the bank1 cannot be issued due to the timing constraints, tRRD (D4).

III. MODELING DETAILS

As discussed in the previous section, the minimum time forexecuting a DRAM command depends on the following one.In this section, the minimum inter-command delays are definedfor possible DRAM command pairs. The bank busy time ismodeled as a weighted sum of the minimum inter-commanddelays, where the weights are the number of occurrences ofeach DRAM command pair during program execution.

A. Assumptions

The following assumptions are made in order to keep themodeling process concise.

1) There are four DRAM commands: PRE (precharge),ACT (activate), RD (read), and WR (write). Othercommands such as REF (refresh), RDAP/WRAP (auto-precharge after RD/WR) are excluded. The effect of REFcommands to the access performance is negligible, andthe auto-precharge commands are not generated whenthe memory controller adopts the open policy for row-buffer management.

2) All memory references represent cache misses, and eachof them has a fixed-length of a cache line size, 64 Bytes.And the data bus width is 64 bits (8 DRAM chips × 8data pins per chip).

3) An RD or WR command accesses 8 column data thatmakes the burst length of 8. Thus, one RD or WRcommand, which transfers one cache line, correspondsto one read or write memory reference.

4) The memory controller’s row-buffer-management policyis open. When the memory reference misses the row-buffer, the memory controller issues a chain of DRAMcommands: PRE, ACT, RD for a read request or PRE,ACT, WR for a write one.

5) The memory controller conducts out-of-order memoryaccess scheduling [13]: FR-FCFS (First-Ready First-Come First-Served). Thus, it tries to minimize the thenumber of row-buffer misses by exploiting the localityremaining in the main memory access streams.

B. Minimum Inter-command Delays

As explained in Section II, the minimum timing delayof each DRAM command should be defined in a pair-wisefashion. There can be 16 DRAM command pairs because fourDRAM commands are assumed. But, six of them, ACT-ACT,PRE-PRE, ACT-PRE, PRE-RD, PRE-WR, RD-ACT, and WR-ACT, never happen while the memory controller is workingcorrectly. The possible DRAM command pairs are PRE-ACT,ACT-RD, ACT-WR, WR-WR, WR-RD, RD-WR, RD-RD,WR-PRE, and RD-PRE. The minimum inter-command delaysfor them are defined as follows.

The minimum inter-command delays for PRE-ACT andACT-WR(RD) are defined as tRTP and tRCD, respectively.As shown in Fig. 3, if a PRE command is issued to a bank,any other command cannot be issued to that bank during tRP .Similarly, if an ACT command is issued to a bank, any othercommand cannot be issued to that bank during tRCD.

When a column command, WR or RD, is issued, the min-imum inter-command delays vary according to the commandthat follows. The WR-WR pair means that there are twoconsecutive write requests to the same bank, and the latterone hits the row-buffer. The minimum inter-command delay ofWR-WR pair should be tCCD to prevent data bus contention.The WR-RD pair occurs when there is a read request after awrite one, and the read request hits the row-buffer. In thiscase, the RD command should be issued after tWTR (write-to-read delay) from the time instant when the data transferfor the WR command is completed. The WR command startsdata transfer after tCWL (CAS write latency) from the instantwhen the command is issued, and it takes tCCD to transferthe data. As a result, the minimum inter-command delay ofWR-RD pair is defined as tCWL+tCCD+tWTR. The WR-PRE pair means that there is a row-buffer miss after the writerequest. Note that the row-buffer miss request could be a readone or write one. A PRE command can be issued only aftertWR (write recovery) from the time instant when the datatransfer for the WR command is completed. The minimuminter-command delay of WR-PRE is tCWL+ tCCD+ tWR.

Similar to the WR command, when an RD command isissued, the minimum inter-command delays vary according tothe command that follows. The RD-RD pair means that thereare two consecutive read requests to the same bank and thelatter one hits the row-buffer. The minimum inter-commanddelay for the RD-RD pair is tCCD. The RD-WR pair occurswhen there is a write request after a read one and the writerequest hits the row-buffer. In this case, the WR commandshould be issued after tRTW from the time instant when theRD command is issued. Note that there is no timing parameterdefined as tRTW in the DRAM data sheet [12]. tRTW isdefined in this paper as a counterpart of tWTR .

69

Page 5: [IEEE Software (ISPASS) - Austin, TX, USA (2011.04.10-2011.04.12)] (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE - Memory access pattern-aware

TABLE IMINIMUM INTER-COMMAND DELAYS FOR DRAM COMMAND PAIRS AND PATTERN PARAMTERS

DRAM command pair min. inter-command delay pattern parameters main memory requests

PRE-ACT tRPNm row-buffer misses

ACT-WR, ACT-RD tRCDWR-PRE tCWL+ tCCD + tWR Nwp row-buffer miss after writeRD(x)-PRE tRAS − tRCD − (x− 1)tCCD Nrx row-buffer miss after x consecutive reads w/o writeRD(others)-PRE tRTP Nrt other cases for row-buffer missWR-WR tCCD Nww write/hit request after writeRD-WR tRTW Nrw write/hit request after readWR-RD tCWL+ tCCD + tWTR Nwr read/hit request after writeRD-RD tCCD Nrr read/hit request after read

The minimum inter-command delay of RD-PRE pair iscomplicated because there are two timing constraints, tRASand tRP , on the PRE command. As shown in Fig. 3-(c),the PRE command could be issued after tRP from thetime instant when the RD command is issued. At the sametime, the time delay between ACT and PRE must satisfythe timing constraint, tRAS (row access strobe), as shownin Fig. 3-(d). If there is only one RD command betweenACT and PRE, the minimum inter-command delay for RD-PRE is tRAS − tRCD, since tRCD + tRTP is smallerthan tRAS. If two RD commands are issued between ACTand PRE, the minimum inter-command delay for RD-PREbecomes tRAS − tRCD − tCCD, and so on. Let there arex RD commands between ACT and PRE, then the minimuminter-command delay of RD-PRE can be expressed as tRAS−tRCD − (x − 1)tCCD, which must be larger than or equalto tRTP . From the relation, the maximum number of RDcommands (m), which makes the minimum inter-commanddelay of RD-PRE to be tRAS− tRCD− (x− 1)tCCD, canbe represented as

m = �(tRAS − tRCD − tRTP )/tCCD�. (1)

If the number of RD commands (x) is larger than m,the minimum inter-command delay for RD-PRE is tRTP ,otherwise tRAS − tRCD − (x − 1)tCCD. But, if there isa WR command issued between ACT and PRE, the minimuminter-command delay of RD-PRE is always tRTP . The reasonis that the minimum inter-command delay for WR-RD makesACT-PRE greater than tRAS for DRAM models used in ourexperiments.

Table I summarizes the minimum inter-command delays forDRAM command pairs as defined above. Note that ‘RD(x)’means that there are x consecutive RD commands betweenACT and PRE commands. ‘RD(others)’ means that the numberof RD commands between ACT and PRE is larger than m orthere is more than one WR command between ACT and PRE.

C. Pattern Parameters

The pattern parameter is defined as the number of oc-currences of each DRAM command pair during programexecution. They can be interpreted as the characteristics ofmemory access streams by considering the relationships be-tween DRAM commands and memory requests.Nm is defined as the number of ACT commands issued,

and it is the same with the number of row-buffer misses. The

row-buffer miss cases can be classified into Nwp, Nrx (x=1,2,· · · , m), and Nrt according to the type of previous requests.Nwp can be interpreted as the number of row-buffer missesthat follow a write request. Nrx is the number of row-buffermisses that follow consecutive x read requests after the rowhas been opened. Nrt is the number of row-buffer misses thatare not included in Nwp and Nrx.

The number of row-buffer hits are classified into fourcases. Nww is the number of write requests that hit the row-buffer after write one. Similarly, Nrw, Nwr, and Nrr can beinterpreted as summarized in Table I.

Note that the pattern parameters are the characteristics ofmain memory access streams, which have been distributedto banks and ordered according to the memory controller’saddress mapping scheme and memory access scheduler, re-spectively.

D. Bank Busy Time

The length of bank busy time is the sum of the minimuminter-command delays of all commands issued to a bank duringprogram execution. When the number of occurrences of eachDRAM command pair is given as Table I, the length of bankbusy time is represented as follows.

Sbusy = tRP ·Nm

+ tRCD · (Nm −∑m

x=1Nrx)

+ tCCD · (Nww +Nwr +Nrr +Nwp

−∑m

x=1(x− 1)Nrx)

+ tCWL · (Nwr +Nwp)

+ tRTW ·Nrw + tWTR ·Nwr

+ tRAS ·∑m

x=1Nrx

+ tWR ·Nwp + tRTP ·Nrt.

(2)

Note that the bank busy time is different from the DRAMaccess latency, and Eq. (2) does not include tCAS, columnaccess strobe, which is related with read access latency.Whereas, tCWL, column write latency, is included in Eq. (2)because the timing is related with write-to-read turnaround andprecharging after a write request.

To inspect the impact of pattern parameters and DRAMtimings on the bank busy time, the average bank busy timeis defined as the average time per one memory request. Let

70

Page 6: [IEEE Software (ISPASS) - Austin, TX, USA (2011.04.10-2011.04.12)] (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE - Memory access pattern-aware

N be the number of memory requests to a bank. Then, theaverage bank busy time is

Tbusy =Sbusy/N

=tRP · w0 + tRCD · w1 + tCCD · w2

+ tCWL · w3 + tRTW · w4 + tWTR · w5

+ tRAS · w6 + tWR · w7 + tRTP · w8,

(3)

where w0, w1, . . . , w8, are defined as

w0 = Nm/N

w1 = (Nm −∑m

x=1Nrx)/N

w2 = (Nww +Nwr +Nrr +Nwp

−∑m

x=1(x− 1)Nrx)/N

w3 = (Nwr +Nwp)/N

w4 = Nrw/N

w5 = Nwr/N

w6 =∑m

x=1Nrx/N

w7 = Nwp/N

w8 = Nrt/N.

(4)

To summarize, the average bank busy time is representedas a weighted sum of DRAM timing parameters, wherethe weights, w0, w1, . . . , w8, are determined by the patternparameters. The weighted value of DRAM timing, or eachterm of Eq. (3), represents the impact of the timing on thebank busy time.

E. Bank Utilization

We define the bank utilization to explain the system-level performance when the execution time is limited bymemory access performance, named memory bounded. Thisassumption holds when there are enough number of memoryrequests and they are independent of each other. This happenswhen there is a large number of processors executing multi-threaded programs. Let Sf be the amount of time delay dueto inter-bank interference. The bank utilization is defined asUbank = (Sbusy + Sf)/T , where T is the execution time. Thevalue for Sf will be measured during simulation, since it isnot included in the proposed model. We will compare the bankutilization and the data bus utilization for memory-boundedcases in Section IV.

IV. EXPERIMENTAL RESULTS

A. Experimental Setup

The proposed model is evaluated by using execution-drivensimulation with M5 [14], an architecture simulator, and itsmain memory module extension for detailed DRAM timingsimulation. The simulator is configured as summarized inTable II. The DRAM timing parameters are obtained fromJEDEC DDR3 SDRAM Standard [12] as shown in Table IV.During the program execution, the DRAM timing simulatorcollects the pattern parameters by counting the number of

TABLE IISIMULATED ARCHITECTURE CONFIGURATION

processor

simple, in-order, Alpha ISAclock frequency 2 GHz# of processors (P) 1, 2, 4, 8, 16, and 32

cachesL1 split(L1I, L1D), private

64 KB, 2-way

L2unified, shared512 KB, 8-way

address space 2 GBytesmemory access scheduler FR-FCFS

main row-buffer-management open policymemory address mapping row:bank:col

channel 1bus width 64 bits

DRAMrow-buffer size 1 KBytes × 8 chipsnumber of banks 8timing parameters DDR3-800,-1066,-1333,-1600

TABLE IIIWORKLOADS

kernel/application descriptionFFT.MT 512 × 512 matrix transpose phaseFFT.MM 512 × 512 matrix multiplication phaseRadix the number of keys to sort 262144, radix = 2048OceanContig grid size : 258 × 258Cholesky input : tk23.OLUContig 512 × 512Raytrace input : teapot.envFMM 2048 particles

occurrences of each DRAM command pair as explained inSection III.

Seven multi-threaded workloads are chosen from SPLASH-2 benchmark as summarized in Table III. For FFT, twoprogram phases are selected: matrix transpose and matrixmultiplication. Refer to [15][16] for description and imple-mentation details of the workloads. The initialization phaseof each workload is excluded from the evaluation because itshows almost the same memory access behavior even if thenumber of processors is changed.

B. Characteristics of DRAM Accesses

The pattern parameters for the workloads are obtained fromthe simulation with DDR3-800 timing parameters as shown inFig. 4. The results for other DRAM models also show similarcharacteristics. The number of memory requests to each bankare classified into four hit and four miss cases as the proposedmodel suggests. Note that m is 2 for the DDR3-800 model.

The row-buffer miss ratio is adopted in many DRAMperformance studies, and it can be derived from the patternparameters as Nm / N , or w0 in Eq. (4). The row-buffer missratio, in general, grows as the number of processors increases.Even if the locality of memory access from a single processoris high, it is vanished away when the memory accesses areintermingled with those from other processors. For example,as shown in Fig. 4-(b), the row-buffer miss ratio of FFT.MMis about 0.5 in a single processor, but it reaches to 0.9 whenthe number of processors is 32. If the row-buffer miss ratiois close to 1.0, the pattern parameters are simplified as Nwp

and Nr1, which are the number of write and read requests,

71

Page 7: [IEEE Software (ISPASS) - Austin, TX, USA (2011.04.10-2011.04.12)] (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE - Memory access pattern-aware

TABLE IVDRAM TIMING PARAMETERS (UNIT: NSEC)

DDR3-800 DDR3-1066 DDR3-1333 DDR3-1600 description5-5-5 6-6-6 7-7-7 8-8-8tCK 2.5 1.875 1.5 1.25 clock cycle timetRP 12.5 11.25 10.5 10.0 time delay for prechargingtRCD 12.5 11.25 10.5 10.0 time delay for activatingtCCD 10.0 7.5 6.0 5.0 column-to-column delaytCAS 12.5 11.25 10.5 10.0 read latency (column access strobe)tCWL 12.5 11.25 10.5 10.0 write latencytRTW 15.0 11.25 9.0 7.5 read-to-write delaytWTR 10.0 7.5 7.5 7.5 write-to-read delaytRAS 37.5 37.5 36.0 35.0 activate-to-precharge delay(row access strobe)tWR 15.0 15.0 15.0 15.0 write recovery timetRTP 10.0 7.5 7.5 7.5 read-to-precharge delaytRRD 10.0 7.5 6.0 6.0 activate-to-activate delaytFAW 40.0 37.5 30.0 30.0 four activation window

the number of processors, bank0 ~ bank71 2 4 8 16 32#

of m

emor

y re

ques

ts (x

103 )

5

10

15

20

25

the number of processors, bank0 ~ bank71 2 4 8 16 32

10

20

30

the number of processors, bank0 ~ bank71 2 4 8 16 32

10

20

30

40

50

(a) FFT.MT (b) FFT.MM (c) Radix

the number of processors, bank0 ~ bank71 2 4 8 16 32#

of m

emor

y re

ques

ts (x

103 )

100

200

300

the number of processors, bank0 ~ bank71 2 4 8 16 32

100

200

300

400

500

600

the number of processors, bank0 ~ bank71 2 4 8 16 32

50

100

150

200

(d) OceanContig (e) Cholesky (f) LUContig

the number of processors, bank0 ~ bank71 2 4 8 16 32#

of m

emor

y re

ques

ts (x

103 )

20

40

60

80

100

120

the number of processors, bank0 ~ bank71 2 4 8 16 32

20

40

60Nww (write-write/hit)Nrw (read-write/hit)Nwr (write-read/hit)Nrr (read-read/hit)Nr2 (miss after 2 reads)Nr1 (miss after 1 read)Nrt (miss after read, other cases)Nwp (miss after write)

(g) Raytrace (h) FMM

Fig. 4. Pattern parameters (bank 0 to 7 from the left-most column for each processor count). The y-axis is the number of memory requests for each bank

respectively, since other pattern parameters are close to zero.An important issue on the pattern parameters is to deter-

mine representative values for a workload. When the memoryaccesses are distributed uniformly to all banks like FFT.MT,FFT.MM and OceanContig, it is safe to use average valuesacross all banks as pattern parameters for the workload.However, when the memory accesses are concentrated on oneor two banks like Raytrace and FMM, the average valuescannot be the representative ones for the workload. The

problem is more serious when the performance is memory-bounded, where the memory requests to one bank determinethe execution time. However, the execution time is a verycomplex function of various factors and it is not enoughto explain it with only the pattern parameters given. Anapproximation for the case of non-uniform distribution andmemory-bounded is to select a bank that has the maximumnumber of requests and use the pattern parameters of that bankfor the workload.

72

Page 8: [IEEE Software (ISPASS) - Austin, TX, USA (2011.04.10-2011.04.12)] (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE - Memory access pattern-aware

w0 w1 w2 w3 w4 w5 w6 w7 w8

valu

e

0.0

0.2

0.4

0.6

0.8

1.0

tRP tRCD tCCD tCWL tRTW tWTR tRAS tWR tRTP

time

(nse

c)

0

5

10

15

20

25

30

(a) The coefficients of the average bank busy time (b) DDR3-800 timings weighted by the coefficients

Fig. 5. The coefficients of the average bank busy time and the DDR3-800 timings weighted by the coefficients across all banks and workloads when thenumber of processor cores is 32 (the box means 25th ∼ 75th percentile of the data set)

C. Impact of DRAM Timings on Bank Busy Time

The range of the weights, wi(i = 0, 1, · · · , 8), of Eq. (4) andweighted DDR3-800 timings for all workloads are depicted inFig. 5, where the pattern parameters are given as shown inFig. 4 for the processor counts of 32.

According to Eq. (4), the row-buffer miss ratio, w0, is theweight of tRP in expressing the bank busy time, and w0 isin the range of 0.56 ∼ 0.99. The average of w0 is about 0.82which is the largest one among the employed weights. Asa result, the impact of tRP on the average bank busy timeis the second largest one among all DRAM timings. It is inthe range of 17% ∼ 24% and the average is 21%. It impliesthat the average bank busy time can be effectively reduced bydecreasing tRP .

The weight of tRAS, w6, has the property that it has largevariations across workloads (0.11 ∼ 0.72). The average valueof w6 is about 0.37, which is small compared to other weightssuch as w0, w1, w2, w3, and w7 as shown in Fig. 5-(a).However, tRAS is the largest one among DRAM timings forall DRAM models, and its impact on the average bank busytime is greater than others as shown in Fig. 5-(b). Its impactof the average bank busy time is about 24% ∼ 36%.

The weight of tCCD, or w2, is about 0.58 in average, andtCCD is 10 nsec for DDR3-800 model. The weighted timingof tCCD is about 5.8 nsec in average, and it takes about 12%in average for all workloads (its range is 6% ∼ 17%) of theaverage bank busy time. Note that, among the DRAM timings,the value of tCCD is the only one which is scaled down indirect proportion to DRAM clock frequency increase as shownin Table IV. It means that doubling the DRAM clock frequencycontributes to only about 12% reduction of the average bankbusy time.

The impact of tRTW , tWTR, and tRTP on the averagebank busy time is negligible for all workloads because theweights of them, w4, w5, and w8, are lower than 0.1. It meansthat the read-to-write and write-to-read turnaround overheads,expressed as tRTW and tWTR, respectively, do not affectthe average bank busy time effectively. The reason is that therow-buffer miss ratio is high in most of the workloads. Andthe row-buffer hit cases, if ever, are included in the read-to-read or write-to-write cases, but not in the read-to-write orwrite-to-read cases.

The impact of tRCD, tCWL and tWR on the averagebank busy time are similar because both the timings and theweights of them, w1, w3 and w7, are not much different. Theirimpact on the average bank busy time is in the range of 10%∼ 14%.

D. Bank Utilization for Memory-bounded Cases

Figure 6 compares the bank and the data bus utilizationwhen the execution time is limited by DRAM accesses. Forthat purpose, the configuration of the simulator is slightlymodified so that there is no shared L2 cache and the numberof processor is 64. The data bus utilization can be calculatedeasily when the number of memory requests for all the banks isgiven. The bank utilization is calculated as defined in SectionIII, where Sbusy is calculated by using Eq. (2) with thepattern parameters obtained and DRAM timings and Sf ismeasured by DRAM timing simulator. The bank utilizationfor the workload is determined by one representative bank,where the number of requests is the maximum.

The results show that the bank utilization becomes theperformance limiting factor as the DRAM clock frequencyincreases. When the DRAM clock frequency is doubled fromDDR3-800 to DDR3-1600, the data bus busy time, Dbusy , isreduced by 50%. It is obvious that the data bus busy time isin direct proportion to DRAM clock frequency. The amountof time delay due to inter-bank interference, Sf , is sensitiveto DRAM clock frequency and it is reduced by 66% on theaverage for all workloads. However, the bank busy time Sbusy

is decreased by only 20% because most of DRAM timings,except tCCD, are not scaled down by the same amount ofDRAM clock frequency increase. As a result, the executiontime is reduced by 32% in average, and the bank busy timebecomes the dominant factor.

When the memory requests are distributed uniformly to allbanks, the amount of time delay due to inter-bank interferenceis reduced as the DRAM clock frequency increases. For exam-ple, Sf of FFT.MT is 39% of the execution time for DDR3-800 and it is reduced to 19% for DDR3-1600. Whereas, whenthe memory requests are distributed non-uniformly, namelythe bank-level parallelism is small, the bank busy time is thedominant factor of the execution time. For example, Sbusy

of Raytrace is 92%, 89%, 92% and 89% of the execution

73

Page 9: [IEEE Software (ISPASS) - Austin, TX, USA (2011.04.10-2011.04.12)] (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE - Memory access pattern-aware

FFT.MT FFT.MM Radix OceanContig Cholesky LUContig Raytrace FMM

Util

izat

ion

(nor

mal

ized

)

0.0

0.2

0.4

0.6

0.8

1.0

DDR3-800DDR3-1066

DDR3-1333DDR3-1600

free-time

inter-bank interference (Sf )bank busy (Sbusy)

data bus busy (Dbusy )

Fig. 6. Comparisons of DRAM bank and data bus utilization for memory-bounded cases, where the number of processors is 64 without shared L2 caches.The x-axis represents DRAM models and workloads. The y-axis represents the execution time which is normalized to DDR3-800 case of each workload, andit is decomposed into (free-time) + (time delays due to inter-bank interference) + (bank busy time) for the bank utilization or (free-time) + (data bus busytime) for the data bus one.

time in four DRAM models, respectively. Sf is at most 6% ofthe execution time and increasing the DRAM clock frequencydoes not help in reducing the execution time. For the non-uniform distribution cases, enhancing the address mappingscheme of the memory controller [17][18][19] is the firstchoice for optimization. However, even if the available bank-level parallelism is exploited as much as possible with thehelp of the enhanced scheme, the performance increase ismarginal because the amount of time delay due to inter-bankinterference increases as shown in the uniform distributioncases. Thus, enhancing the address mapping scheme shouldbe accompanied with DRAM clock frequency increase tomaximize its effect on the execution time for memory-boundedcases.

However, as DRAM clock frequency increases, the bankbusy time becomes dominant in the execution time and, insome cases such as FFT.MT and Radix for DDR3-1600, itis larger than the data bus busy time. Supposing that thetrend of DRAM timings is continued, only tCCD is scaleddown in direct proportion to DRAM clock frequency and othertimings are shortened slightly or not changed at all. Then, thebank busy time would exceed the data bus busy time and theexecution time is limited by the bank busy time. As a result, itis crucial to lower DRAM timings as DRAM clock frequencyincreases.

V. DISCUSSION

The effect of inter-bank interference is not considered inthe proposed model. The memory performance model basedon the bank busy time is valid only when the effect of inter-bank interference is negligible. There are two cases in Fig. 6when the effect of inter-bank interference is small comparedto the bank busy time. The first one is when DRAM clockfrequency is high. For example, the time delays due to inter-bank interference of FFT.MM for DDR3-800 is 53% of theexecution time, but it is reduced to 25% for DDR3-1600. Theother case is when the memory accesses are done to one ortwo banks such as Raytrace and FMM as shown in Fig. 4,where the bank busy time is at least 89% of the execution

time. Thus, to completely address the characteristics of mainmemory access streams and to predict the DRAM accessperformance precisely, modeling the amount of time delaysdue to inter-bank interference is vital for accuracy.

According to the simulation results, the effect of inter-bankinterference is related with the DRAM clock frequency and thedistribution of memory requests to banks. The chance of buscontention is reduced when DRAM clock frequency increasesor the distribution is non-uniform. In addition to them, thereare inter-bank timing constraints such as tRRD and tFAW ,which should be considered in deciding when to issue activatecommands. We plan to analyze the relationships between thesefactors quantitatively and predict the amount of time delay dueto inter-bank interference for our further work.

VI. CONCLUDING REMARKS

This paper proposes an analytical DRAM performancemodel to predict the length of bank busy time. The bank busytime is defined as the length of time spent by a bank to servicethe memory request to the bank, and is affected by successivecommand pairs issued to DRAM due to the effect of row-buffers. The bank busy time is a more reliable performancemetric than the bus busy time when the number of processorsincreases and the execution becomes a memory bounded prob-lem. We represent the bank busy time as a linear combinationof DRAM timings and the pattern parameters that capture thecharacteristics of main memory access streams. The proposedmodel provides simple and clear insights how the DRAMaccess performance is affected by DRAM timings and mainmemory access behavior. According to the simulation resultsfor memory-bounded cases, the bank busy time becomesdominant in the execution time as DRAM clock frequencyincreases. Since the clock frequency increase can only reducethe bus busy time, this study shows the limitation of the currenttrend of DRAM timing optimization. The proposed modelis expected to be a useful tool for providing DRAM timingguidelines in the early design stage of next DRAM standards.We plan to extend the model to address the effect of inter-bankinterference as a future work.

74

Page 10: [IEEE Software (ISPASS) - Austin, TX, USA (2011.04.10-2011.04.12)] (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE - Memory access pattern-aware

ACKNOWLEDGMENT

We appreciate the constructive comments from the anony-mous reviewers. We wish to thank Prof. Zhao Zhang forhelpful suggestions. This work was supported by the KoreaScience and Engineering Foundation(KOSEF) grant funded bythe Korea government(MEST) (No. 2010-0027735).

REFERENCES

[1] W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implicationsof the obvious,” SIGARCH Comput. Archit. News, vol. 23, no. 1, pp.20–24, March 1995.

[2] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quanti-tative Approach, 4th ed. Morgan Kaufmann Publishers Inc., 2006.

[3] B. Jacob, S. Ng, and D. Wang, Memory Systems: Cache, DRAM, Disk.Morgan Kaufmann Publishers Inc., 2007.

[4] XDR datasheet, Rambus, Inc., 2002.[5] B. Ganesh, A. Jaleel, D. Wang, and B. Jacob, “Fully-buffered DIMM

memory architectures: Understanding mechanisms, overheads and scal-ing,” in Proc. Int. Symp. High-Performance Computer Architecture,2007.

[6] J. Ahn, M. Erez, and W. Dally, “The design space of data-parallelmemory systems,” in Proc. ACM/IEEE Conf. Supercomputing, 2006.

[7] G. Yuan and T. Aamodt, “A hybrid analytical DRAM performancemodel,” in Proc. Workshop on Modeling, Benchmarking and Simulation,2009.

[8] I. Y. Bucher and D. A. Calahan, “Models of access delays in multipro-cessor memories,” IEEE Trans. Parallel Distrib. Syst., vol. 3, no. 3, pp.270–280, May 1992.

[9] B. Jang, D. Schaa, P. Mistry, and D. Kaeli, “Exploiting memory accesspatterns to improve memory performance in data parallel architectures,”IEEE Trans. Parallel Distrib. Syst., vol. 22, no. 1, pp. 105–118, January2010.

[10] R. C. Murphy and P. M. Kogge, “On the memory access patterns ofsupercomputer applications: Benchmark selection and its implications,”IEEE Trans. Compt., vol. 56, no. 7, pp. 937–945, July 2007.

[11] E. Strohmaier and S. Hongzhang, “Architecture independent perfor-mance characterization and benchmarking for scientific applications,”in Proc. Int. Symp. Modeling, Analysis, and Simulation of Computerand Telecommunications Systems, 2004.

[12] DDR3 SDRAM Standard, 2008. [Online]. Available:http://www.jedec.org/standards-documents/docs/jesd-79-3d

[13] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens,“Memory access scheduling,” SIGARCH Comput. Archit. News, vol. 28,no. 2, pp. 128–138, May 2000.

[14] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, andS. K. Reinhardt, “The M5 simulator: Modeling networked systems,”IEEE Micro, vol. 26, no. 4, pp. 52–60, July 2006.

[15] S. C. Woo, J. P. Singh, and J. L. Hennessy, “The performance advantagesof integrating block data transfer in cache-coherent multiprocessors,”SIGPLAN Not., vol. 29, no. 11, pp. 219–229, November 1994.

[16] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “TheSPLASH-2 programs: Characterization and methodological considera-tions,” in Proc. Int. Symp. Computer Architecture, 1995.

[17] Z. Zhang, Z. Zhu, and X. Zhang, “A permutation-based page interleavingscheme to reduce row-buffer conflicts and exploit data locality,” in Proc.ACM/IEEE Int. Symp. Microarchitecture, 2000.

[18] J. Shao and B. T. Davis, “The bit-reversal SDRAM address mapping,”in Proc. Int. Workshop Software and Compilers for Embedded Systems,2005.

[19] R. Raghavan and J. P. Hayes, “On randomly interleaved memories,” inProc. ACM/IEEE Conf. Supercomputing, 1990.

75