11
Effective Management of DRAM Bandwidth in Multicore Processors Nauman Rafique, Won-Taek Lim, Mithuna Thottethodi School of Electrical and Computer Engineering, Purdue University West Lafayette, IN 47907-2035 {nrafique,wlim,mithuna}@purdue.edu Abstract Technology trends are leading to increasing number of cores on chip. All these cores inherently share the DRAM bandwidth. The on-chip cache resources are limited and in many situa- tions, cannot hold the working set of the threads running on all these cores. This situation makes DRAM bandwidth a crit- ical shared resource. Existing DRAM bandwidth management schemes provide support for enforcing bandwidth shares but have problems like starvation, complexity, and unpredictable DRAM access latency. In this paper, we propose a DRAM bandwidth management scheme with two key features. First, the scheme avoids unex- pected long latencies or starvation of memory requests. It also allows OS to select the right combination of performance and strength of bandwidth share enforcement. Second, it provides a feedback-driven policy that adaptively tunes the bandwidth shares to achieve desired average latencies for memory ac- cesses. This feature is useful under high contention and can be used to provide performance level support for critical ap- plications or to support service level agreements for enterprise computing data centers. 1. Introduction The two technology trends of increasing transistors per die and limited power budgets have driven every major processor vendor to multicore architectures. The number of on-chip cores is expected to increase over the next few years [1]. This in- creasing number of cores (and consequently threads) share the off-chip DRAM bandwidth which is bound by technology con- straints [2, 4]. While on-chip caches help to some extent, they cannot possibly hold the working set of all the threads running on these on-chip cores. Thus DRAM bandwidth is a critical shared resource which, if unfairly allocated, can cause poor performance or extended periods of starvation. Because it is unacceptable that applications suffer from unpredictable per- formance due to the concurrent execution of other applications, DRAM bandwidth allocation primitives that offer isolated and predictable DRAM throughput and latency are desirable design goals. This paper develops simple, efficient and fair mecha- nisms that achieve these two goals. The challenges in achieving these goals are twofold. First, the time to service a DRAM request depends on the state of the DRAM device which in turn depends on previous DRAM operations. This makes it hard to leverage well-understood fair queuing mechanisms since memory latency is not only vari- able, it also depends on the previously scheduled DRAM ac- cesses. Second, even in the presence of fair queuing mech- anisms that preserve relative bandwidth shares of contending sharers, there can still be significant variation in access latency because of (a) variation in application demand and (b) variation in DRAM state due to interleaved requests from other sharers. Allocating bandwidth shares to achieve the necessary latency for the worst case is wasteful as also observed by others [27]. Our paper develops two key innovations to address the above two challenges of managing shared DRAM bandwidth in multicore systems. First, we develop a fair bandwidth sharing mechanism that abstracts away the internal details of DRAM operation such as precharge, row activation and column acti- vation and treats every DRAM access as a constant, indivis- ible unit of work. Our view of memory accesses as indivis- ible unit-latency operation is purely a logical construct to fa- cilitate fair bandwidth sharing. In practice, our scheme does employ memory access scheduling optimizations that schedule individual DRAM commands to exploit page hits. One obvi- ous benefit of our technique is that it enables fair sharing with lower complexity compared to a previous technique that incor- porates detailed DRAM access timing [27]. One may think that this reduction in complexity is achieved at the cost of schedul- ing efficiency because the abstraction hides the true nature of DRAM access which is inherently variable and dependent on the state of the DRAM device. On the contrary, our technique yields improvements in the DRAM latencies observed by shar- ers and consequently, in overall performance. This is primarily because the simple abstraction of memory requests as an indi- visible constant unit of work enables us to adopt existing fair queuing schemes with known theoretical properties. In partic- ular, we adopt start-time fair queuing [8] which offers provably tight bounds on the latency observed by any sharer. The above bandwidth sharing scheme can efficiently en- force specified bandwidth shares. However, the total mem- ory latency observed by memory requests is highly variable because it depends not only on the bandwidth share but also on contention for the device. Under high contention, the la- tencies observed by request in memory system can be hundred of cycles more than the latencies observed without contention, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007) 0-7695-2944-5/07 $25.00 © 2007

[IEEE 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007) - Brasov, Romania (2007.09.15-2007.09.19)] 16th International Conference on Parallel

  • Upload
    mithuna

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007) - Brasov, Romania (2007.09.15-2007.09.19)] 16th International Conference on Parallel

Effective Management of DRAM Bandwidth in Multicore Processors

Nauman Rafique, Won-Taek Lim, Mithuna ThottethodiSchool of Electrical and Computer Engineering, Purdue University

West Lafayette, IN 47907-2035{nrafique,wlim,mithuna}@purdue.edu

Abstract

Technology trends are leading to increasing number of coreson chip. All these cores inherently share the DRAM bandwidth.The on-chip cache resources are limited and in many situa-tions, cannot hold the working set of the threads running onall these cores. This situation makes DRAM bandwidth a crit-ical shared resource. Existing DRAM bandwidth managementschemes provide support for enforcing bandwidth shares buthave problems like starvation, complexity, and unpredictableDRAM access latency.

In this paper, we propose a DRAM bandwidth managementscheme with two key features. First, the scheme avoids unex-pected long latencies or starvation of memory requests. It alsoallows OS to select the right combination of performance andstrength of bandwidth share enforcement. Second, it providesa feedback-driven policy that adaptively tunes the bandwidthshares to achieve desired average latencies for memory ac-cesses. This feature is useful under high contention and canbe used to provide performance level support for critical ap-plications or to support service level agreements for enterprisecomputing data centers.

1. Introduction

The two technology trends of increasing transistors per dieand limited power budgets have driven every major processorvendor to multicore architectures. The number of on-chip coresis expected to increase over the next few years [1]. This in-creasing number of cores (and consequently threads) share theoff-chip DRAM bandwidth which is bound by technology con-straints [2, 4]. While on-chip caches help to some extent, theycannot possibly hold the working set of all the threads runningon these on-chip cores. Thus DRAM bandwidth is a criticalshared resource which, if unfairly allocated, can cause poorperformance or extended periods of starvation. Because it isunacceptable that applications suffer from unpredictable per-formance due to the concurrent execution of other applications,DRAM bandwidth allocation primitives that offer isolated andpredictable DRAM throughput and latency are desirable designgoals. This paper develops simple, efficient and fair mecha-nisms that achieve these two goals.

The challenges in achieving these goals are twofold. First,

the time to service a DRAM request depends on the state ofthe DRAM device which in turn depends on previous DRAMoperations. This makes it hard to leverage well-understood fairqueuing mechanisms since memory latency is not only vari-able, it also depends on the previously scheduled DRAM ac-cesses. Second, even in the presence of fair queuing mech-anisms that preserve relative bandwidth shares of contendingsharers, there can still be significant variation in access latencybecause of (a) variation in application demand and (b) variationin DRAM state due to interleaved requests from other sharers.Allocating bandwidth shares to achieve the necessary latencyfor the worst case is wasteful as also observed by others [27].

Our paper develops two key innovations to address theabove two challenges of managing shared DRAM bandwidth inmulticore systems. First, we develop a fair bandwidth sharingmechanism that abstracts away the internal details of DRAMoperation such as precharge, row activation and column acti-vation and treats every DRAM access as a constant, indivis-ible unit of work. Our view of memory accesses as indivis-ible unit-latency operation is purely a logical construct to fa-cilitate fair bandwidth sharing. In practice, our scheme doesemploy memory access scheduling optimizations that scheduleindividual DRAM commands to exploit page hits. One obvi-ous benefit of our technique is that it enables fair sharing withlower complexity compared to a previous technique that incor-porates detailed DRAM access timing [27]. One may think thatthis reduction in complexity is achieved at the cost of schedul-ing efficiency because the abstraction hides the true nature ofDRAM access which is inherently variable and dependent onthe state of the DRAM device. On the contrary, our techniqueyields improvements in the DRAM latencies observed by shar-ers and consequently, in overall performance. This is primarilybecause the simple abstraction of memory requests as an indi-visible constant unit of work enables us to adopt existing fairqueuing schemes with known theoretical properties. In partic-ular, we adopt start-time fair queuing [8] which offers provablytight bounds on the latency observed by any sharer.

The above bandwidth sharing scheme can efficiently en-force specified bandwidth shares. However, the total mem-ory latency observed by memory requests is highly variablebecause it depends not only on the bandwidth share but alsoon contention for the device. Under high contention, the la-tencies observed by request in memory system can be hundredof cycles more than the latencies observed without contention,

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

Page 2: [IEEE 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007) - Brasov, Romania (2007.09.15-2007.09.19)] 16th International Conference on Parallel

as shown by our results in Section 5.2. Our second innova-tion is a feedback-based adaptive bandwidth sharing policy inwhich we periodically tune the bandwidth assigned to the shar-ers in order to achieve specified DRAM latencies. Adaptivebandwidth management does not add to the complexity of thehardware because it can be done entirely in software at the op-erating system (OS) or the hypervisor by using interfaces andmechanisms similar to those proposed for shared cache man-agement [29]. While that interface supports various featuressuch as thread migration and thread grouping (wherein a groupof threads acts as a single resource principal), our study as-sumes that each thread running on a processor is a unique re-source principal with its own allocated bandwidth share. Assuch, we use the terms threads, sharers and contenders inter-changeably in the rest of this paper.

Related Work: Recently, researchers have recognized thefairness issues involved in the management of DRAM band-width [25, 27]. Nesbit et al. [27] propose a fair queuingscheme for providing fairness, but their technique can sufferfrom starvation as we discuss in Sections 3.1 and 6. Wecompare our technique to theirs, referred to as FQ-VTVF, inour simulation results reported in Section 5.1. Moscribrodaet al. [25] recognized the starvation problems in the work ofNesbit et al. [27] and proposed a scheme which keeps trackof the latencies threads would experience without contention.Our scheme avoids such hardware overhead and still preventsstarvation problems.

In summary, this paper makes the following contributions:

• We demonstrate a simple and efficient mechanism toachieve fair-queuing for DRAM access that improves av-erage sharer service latency (number of cycles a sharerwaits before receiving service) by 17.1% and improvesoverall performance by 21.7%, compared to a previouslyproposed DRAM fair-queuing scheme [27]. In addition,our technique reduces worst case service latencies by uptoa factor of 15.

• We demonstrate that a feedback-based adaptive policy caneffectively deliver predictable memory latencies (within12% of the target in our experiments) for a preferredsharer. Such a policy can be used to prioritize criticalthreads or support service level agreements in data cen-ters.

The rest of this paper is organized as follows. Section 2 of-fers a brief background of fair queuing and DRAM memorysystem organization. Then we present the details of our tech-nique in Section 3. We discuss our experimental methodologyin Section 4 and present our results in Section 5. We describethe related work in Section 6 and conclude our paper with fu-ture possibilities in Section 7.

2. Background

This study applies the ideas from fair queuing to DRAMsystems. In order to give a clear understanding of the concepts

27

Virtual Time B C DA

520

10 1100

205 10

A D A A A A A DC

45 55 75 95 115 135 14527 35

WeightsService Time

Figure 1. An example for fair queuing (adaptedfrom [13])

in the paper, we introduce fair queuing in this section, followedby an explanation of the working of the DRAM systems.

2.1. Fair Queuing

The notion of fairness is well-defined and well-studied inthe context of networking. Fair queuing systems are based onthe concept of max-min fairness [5, 6]. A resource allocation ismax-min fair if it allocates resources to sharers in proportion totheir relative weights (or shares, commonly represented as φ).A sharer is referred to as backlogged if it has pending requeststo be served. A service discipline which follows the above def-inition of fairness divides the available bandwidth among thebacklogged queues in proportion to their share. If a sharer isnot backlogged, its share of bandwidth is distributed amongbacklogged sharers. As soon as it becomes backlogged again,it starts receiving its share. A sharer is not given credit forthe time it is not backlogged. Such credit accumulation cancause unbounded latency for the other sharers (even thoughthese other sharers did not use the bandwidth at the expenseof the first sharer). The scheduling algorithms which penalizesharers for the using idle bandwidth are generally regarded asunfair [8, 28]. Fair allocation algorithms guarantee a fair shareregardless of prior usage.

Formally, an ideal fair queuing server (commonly referredto as Generalized Processor Sharing or GPS server) can bedefined as follows: Suppose there are N flows and they areassigned weights of φ1, φ2, . . . , φN . Let Wa(t1, t2) be theamount of flow a’s traffic served during interval (t1, t2) duringwhich a is continuously backlogged. Then for a GPS server,

Wa(t1 , t2)

Wb(t1 , t2)=

φa

φb

, where b = 1, 2, . . . ,N , or

Wa(t1 , t2)

φa

Wb(t1, t2)

φb

= 0

Most fair queuing implementations use a virtual clock toimplement GPS (or its approximation). We explain the use ofa virtual clock with an example adapted from [13]. Figure 1shows three sharers, A, B, C and D, with weights of 5, 20,10 and 1 respectively. At the instant shown, only A, C and Dare currently backlogged. These sharers have to be served insuch a way that the service received by each of them is pro-portional to the weights assigned to them. It is not desirable

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

Page 3: [IEEE 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007) - Brasov, Romania (2007.09.15-2007.09.19)] 16th International Conference on Parallel

Row

Dec

oder

Memory Array

Bank 0

Bank 1

Bank N

Row Address

Column Address

Column Decoder

Row BufferSense Amplifiers

Figure 2. DRAM Device Internal Organization

to visit sharers in a simple round robin manner, serving themin proportion to their weights, as that would result in requestsfrom one sharer clustered together and others would experiencelong latencies [13]. We assume here that each request takes oneunit of time for its service. For each sharer, we compute a ser-vice interval which is the reciprocal of its weight. Note herethat weights do not have to add up to any specific value, sotheir value does not change when sharers become backloggedor unbacklogged. At any instant, we select the sharer whichhas the minimum (earliest) scheduled service time. For the in-stant shown in the example, sharer C has the minimum servicetime, i.e, 27. So the virtual time is advanced to 27 and sharerC is served. Since sharer C does not have any requests left, wedo not schedule it. Next, virtual time is advanced to 35 andsharer A is served. Since A has more requests to be served, wehave to reschedule it. A’s service interval is 20, so its next re-quest should be served at time 20+35=55. It is apparent that thevirtual time is advanced to serve the next sharer and thus runsfaster than the real time. We continue in the similar manner andthe resulting service schedule is shown in Figure 1. When nosharer is backlogged, virtual time is set to zero. Once a sharerbecomes backlogged, its new service time has to be computed.Fair queuing schemes differ in how they assign service time toa newly backlogged sharer.

In order to implement the above mentioned algorithm for thegeneral case of variable sized requests, each request is assigneda start and finish tag. Let Sa

i be the start tag of ith packet ofsharer a and F a

ibe its finish tag, these tags are assigned as

follows:Sa

i = MAX(v(Aai ), F a

i−1) and

F ai = Sa

i + lai /φa

where v(Aai ) is the virtual arrival time of ith packet of sharer

a, lai

is the length of the packet and F a0

= 0.For an ideal (GPS) server, the virtual time is computed using

server capacity and the number of backlogged queues [5, 28].Since the number of backlogged queues change frequently(specially in DRAM system), such computation is prohibitivelyexpensive [7, 8]. To solve this problem, Zhang et al. [37] pro-posed a virtual clock algorithm which replaces v(Aa

i ) with realtime in the calculation of the formula for start time. A similarapproach was adopted by Nesbit et al. [27]. This approach does

Sharer 0

Finish Tag Registers

Sharer m−1

Interconnect

Request from L2 Banks

Type Row Column State Data SenderStatus

Transaction Buffer

Channel Scheduler

DRAM Commands

Memory Controller

Start Tag

Figure 3. Architecture of Memory Controller

not provide fairness [7, 8, 25]. We explain that with an exam-ple. Consider a bursty sharer which does not send requests fora long interval. During that interval, the virtual clocks of otherbacklogged sharers would make progress faster than the realclock (as the virtual clock is time scaled by the fraction of thebandwidth assigned to each sharer). When the bursty sharer fi-nally starts sending requests, it would be assigned virtual timemuch smaller than the others sharers and can halt their progressfor an unbounded time.

Several other fair queuing schemes have been proposed withvarious advantages and disadvantages. We refer the reader to[8, 36] for a survey of other approaches. For the purpose ofthis study, we have adopted Start Time Fair Queuing (SFQ)[8]. SFQ assigns a start and finish tag to each packet using thesame formulas mentioned above. But it approximates v(t) bythe start time of the packet in service at time t. If no packet isin service, v(t) is reset to zero. Clearly, this definition makesthe computation of v(t) much simpler. SFQ has strong boundson latency, guarantees fairness even under variable rate servers,and works for arbitrary weight assignment to each sharer [8].

To apply the fair queuing techniques in the context of aDRAM system, it is important to understand DRAM opera-tion. In the remainder of this section, we explain the internaloperation of a DRAM system.

2.2. DRAM System

Modern DRAM is a multi-dimensional structure, with mul-tiple banks, rows and columns. Figure 2 shows the internalstructure of a DRAM device. Each bank of memory has an ar-ray of memory cells, which are arranged in rows and columns.For accessing data in a DRAM bank, an entire row must beloaded into the row buffer by a row activate command. Oncethe row is in row buffer, a column access command (read orwrite) can be performed by providing the column address. Nowif a different row has to be accessed, a precharge commandmust be issued to write the activated row back into the mem-

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

Page 4: [IEEE 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007) - Brasov, Romania (2007.09.15-2007.09.19)] 16th International Conference on Parallel

ory array, before another row activate command can be issued.Multiple DRAM devices are accessed in parallel for a singleread/write operation. This collection of devices which are op-erated upon together is called a rank. Multiple ranks can beplaced together in a memory channel, which means that theyshare the same address and data bus. Each DRAM device man-ufacturer publishes certain timing constraints that must be re-spected by consecutive commands to different banks within arank, and to the same bank. Moreover, the resource constraintshave to be respected, i.e., before a column access commandcan be issued, it must be ensured that the data bus would befree for the column access operation. Each memory channel ismanaged by a memory controller, which ensures that the tim-ing and resource constraints are met. For further details of theinner workings of DRAM devices and controllers, we refer thereader to other studies [31].

Most recent high-end processors and multicores have theirmemory controllers integrated on chip ( [12, 15, 19]). One suchmemory controller is shown in Figure 3 (the part in dotted boxis our addition and would be explained later). When a requestarrives in the memory controller, it is placed in the transac-tion buffer. A channel scheduler selects a memory request tobe serviced from the transaction buffer and issues the requiredDRAM command to the DRAM device. We would refer toa memory reference (read or write) as a memory transactionwhile DRAM operations would be referred to as commands.For this study, we have assumed that transaction buffer has acentralized queue instead of per bank queues. Our scheme canbe easily adapted to a transaction buffer with per bank queues.

Most real memory access schedulers implement a simpleFirst-Ready-First-Come-First-Serve mechanism for selectingDRAM commands [31]. In this mechanism, the access sched-uler selects the oldest command that is ready to be issued.Rixner et al. [31] made the observation that this selection al-gorithm results in suboptimal utilization of DRAM bandwidth.For example, let us assume that DRAM system has receivedthree memory references, all addressed to the same bank. Thefirst and the third access the same DRAM row, while the sec-ond one accesses a different row. It would be bandwidth effi-cient to schedule the third transaction right after the first one,as that would save us a bank precharge and row activation op-eration. But this re-ordering would violate the original orderin which transactions were received. Some implications ofthis re-ordering would be discussed in Section 3.2. Rixner etal. [31] studied memory access scheduling in detail and pro-posed that for efficient memory bandwidth utilization, memoryscheduler should implement the following algorithm: (a) Pri-oritize ready commands over commands that are not ready (b)Prioritize column access commands over other commands (c)Prioritize commands based on their arrival order. We use thisscheduling algorithm as our base case for this study and referto it as Open-Bank-First (OBF) in the rest of this paper.

3. DSFQ Bandwidth Management Scheme

This section describes the implementation details of ourscheme. We start by discussing the implementation of fairqueuing in DRAM in Section 3.1. Section 3.2 discussesthe tradeoff between fairness and performance. Finally, Sec-tion 3.3 describes the algorithm for our adaptive bandwidth al-location policy to achieve targeted DRAM latency for a pre-ferred sharer.

3.1. Fair Queuing for DRAM System

In this section, we explain the modifications made to theDRAM system described in Section 2.2 to implement StartTime Fair Queuing (SFQ). DRAM systems are significantlydifferent from simple queuing systems. Thus there are somefundamental issues to be considered before adapting any queu-ing discipline.

We explained in Section 2.2 that each memory transaction isdivided into DRAM commands (e.g., column access, row ac-tivate, etc.) based on the state of the target DRAM bank. Fora row buffer hit, only a column access command is required,while a row buffer miss requires precharge and row activate be-sides column access. Thus the time taken by a DRAM transac-tion depends on the state of bank and cannot be decided whenthe transaction arrives. This creates problem for calculatingfinish times using the formula mentioned in Section 2.1. Oneapproach to solve this problem is to delay the calculation offinish times, until a command is issued to the DRAM device.This approach was adopted by Nesbit et al. [27]. They notedthat by taking this approach, the commands from the samesharer might be assigned finish tags out-of-order, as memorycommands are re-ordered for efficiency. Thus a younger com-mand might receive a smaller start/finish tags than an oldercommand. In the worst case, younger commands can keep get-ting issued, causing starvation or unexpectedly longer latenciesfor some of the older commands. The calculation of start/finishtags is further complicated by DRAM timing constraints whichare dependent on the exact configuration of the DRAM system.

We propose to solve the above mentioned problems by mak-ing the implementation of fair queuing oblivious to the stateand timing of DRAM device. We accomplish that by treatinga DRAM transaction as a single entity (instead of composedof a variable number of commands). Memory bandwidth isallocated among sharers in units of transactions, instead of in-dividual DRAM commands. Each transaction is considered tobe a request of unit size and assumed to be serviced in a unittime. This approach simplifies the implementation of start-timefair queuing on the DRAM system. Start-time fair queuing de-fines the current virtual time to be the virtual start time of thepacket currently in service. In our case, there are cycles inwhich none of the commands receives service due to DRAMtiming constraints. Thus we assume the current virtual timeto be the minimum start tag assigned to the currently pendingtransactions. This choice ensures that the worst case servicelatency of a newly backlogged sharer is bound by N-1 where

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

Page 5: [IEEE 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007) - Brasov, Romania (2007.09.15-2007.09.19)] 16th International Conference on Parallel

N is the number of sharers1. The finish time of the last trans-action of each sharer is kept in a separate register and can beupdated at the time of arrival of ith transaction from sharer aas: F a

i= Sa

i+ 1/φa. Thus if we support N sharers, we need

N registers to keep track of the finish tag assigned to the lasttransaction of each sharer. We assume that the OS provides re-ciprocals of weights as integers. Thus the computation of finishtags requires simple addition. We would refer to this schemeas DRAM-Start-Time-Fair-Queuing (DSFQ) in the rest of thepaper.

It should be noted here that the treatment of a transactionas a single entity is only for the purposes of start tag assign-ment. Each transaction is actually translated to multiple com-mands, with each command prioritized based on the start tagassigned to its corresponding transaction. One might argue thatif a sharer has more row buffer conflicts than other, it might endup sending more commands to DRAM (as it needs prechargeand row activate besides the column access). We assert thatour scheme enforces memory reference level bandwidth shares.But it can be easily modified to keep track of command band-width, by updating the finish tag registers whenever a com-mand other than column access is issued. Such a modificationwill still assign start tags to transactions at their arrival and thuswould posses all the fairness properties of our scheme. Wehave experimentally tested this modification and found that itshows same results as our original scheme.

3.2. Tradeoff of Fairness and Performance

In the last section, we have provided a methodology to allo-cate DRAM bandwidth to sharers based on their weights. If thecommands are scheduled strictly based on their start tags, theDRAM bandwidth would not be utilized efficiently. In orderto utilize DRAM efficiently and support shares of each sharers,we use the following algorithm for DRAM command sched-uler: (a) Prioritize ready commands over commands that arenot ready. (b) Prioritize column access commands over othercommands. (c) Prioritize commands based on their start tags.Note that this algorithm is similar to the one proposed by Nes-bit et al. [27]. The only difference is that the requests are or-dered based on their start tags, instead of the finish tags. Thereis one potential problem with the above scheduling algorithm.Consider two DRAM transactions for a same bank, one beinga row hit with larger start tag and the other a row miss withsmaller start tag. The above algorithm would serve the transac-tion with larger start tag, as it prioritizes column access com-mands. If there is a chain of transactions with row buffer hitsbut with larger start tags, they can cause unlimited delays fortransactions with smaller start tags. Nesbit et al. [27] made asimilar observation, and proposed that after a bank has beenopen for tRAS time (timing constraint of a DRAM device –time between a row access and next precharge), it should waitfor command with oldest tag to become ready.

We propose a more general and flexible solution to this

1The worst case occurs when every sharer has a pending transaction withthe same start tag as the transaction with the smallest start tag

problem. Whenever a DRAM command is issued, if it doesnot belong to a transaction with the minimum start tag, we in-crement a counter. If the counter reaches a specified threshold,the memory scheduler stops scheduling commands from trans-actions other than the one with minimum start tag and waitsfor those commands to become ready. Once a command fromthe transaction with smallest start tag is issued, the thresh-old counter is reset. This threshold value is called starvationprevention threshold (SPT). The scheduling policy mentionedabove without such a threshold violates the latency bounds ofSFQ. However, the introduction of SPT limits the increase inlatency bound to a constant additive factor because at most SPTrequests can be served before falling back to the SFQ ordering.

The optimal SPT value is environment- and workload-dependent. For example, if the processor is used in a situationwhere providing performance isolation is the key concern, asmaller threshold should be used. On the other hand, if effi-ciently utilizing the DRAM bandwidth is the primary goal, ahigher threshold is desired. It should be noted that the efficientDRAM bandwidth utilization is not always correlated with bet-ter overall performance. In some cases, stricter enforcementof SFQ ordering between transactions (i.e., smaller thresholds)results in better performance. This is because the advantageof improved DRAM bandwidth utilization due to larger thresh-olds is over-shadowed by the increase in latency observed bystarved transactions. Thus we propose to make this thresholdadjustable by operating systems. Our results in Section 5.1show that the optimal value of threshold varies from one bench-mark mix to another and extreme (very high or very small) val-ues of threshold hurt performance.

3.3. Adaptive Policy

The introduction of fair queuing allows us to control the rel-ative share of each sharer in the DRAM bandwidth. But it doesnot provide any control over the actual latencies observed bymemory references in the DRAM. The latency observed by amemory reference in DRAM depends on a lot of factors includ-ing the length of the sharer’s own queue, its bank hit rate, thebank hit rate of other sharers, etc. Moreover, as the requestsfrom different sharers are interleaved, their bank hit/miss ratewould be different as some misses might be caused by inter-ference with other sharers. Thus we cannot guarantee specificperformance levels for any sharers, unless we pessimisticallyallocate shares for them (provisioning for worst case scenar-ios). Such worst case provisioning is not suitable for generalpurpose computing platforms [27].

We propose to solve the above mentioned problem by im-plementing a feedback based adaptive policy in the operatingsystem or hypervisor. For this study, we restrict our policy toachieving target latency for only one of the sharers (extensionof the policy to support target latencies for multiple sharers ispart of the future work, see Karlsson et al. [17] for a discussionon such an algorithm). The adaptive policy tweaks the weightassigned to the target sharer (based on the feedback) to get thedesired latency. The feedback can be obtained by performance

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

Page 6: [IEEE 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007) - Brasov, Romania (2007.09.15-2007.09.19)] 16th International Conference on Parallel

counters provided by processors. For this study, we assumedthat performance counters for “average latency observed by re-quests from a processor” are available. It is expected that suchcounters would be made available since memory controllers areon-chip now. If such counters are not available, a feedback pol-icy can be implemented to achieve target instructions per cycle(IPC).

(* sharei,t− Share of sharer i during interval t *)(* latencyi,t− Average latency of sharer i during interval t *)(* slopei,t− Rate of change of delay with change in share for shareri during interval t *)(* α− Forgetting factor *)

Initialization:

1. for all i:sharei,0 := 100

n

At end of each epoch t:

1. (* Calculate the slope for prioritized sharer p *)

slopep,t :=latencyp,t−latencyp,t−1

sharep,t−sharep,t−1

2. if slopep,t > 0 then:slopep,t := slopep,t−1

3. (* Calculate the share for the next interval *)

sharep,t+1 := sharep,t +(goalp−latencyp,t)

slopep

4. (* Calculate EWMA of calculated share *)sharep,t+1 := (α ∗ sharep,t+1) + ((α − 1) ∗ sharep,t)

5. (* Divide the remaining shares among the rest of the sharersequally *)for all i 6= p:

sharei,t+1 :=100−sharep,t+1

n−1

Figure 4. Adaptive Policy Algorithm

The algorithm used by our feedback based latency controlpolicy is described in Figure 4. It starts out by assigning equalweights to all sharers. At the end of each epoch, it calcu-lates “the rate of change of average latency with the change inshare”, called slope, during the last interval. Since the latencyshould decrease with the increase in share, the slope should benegative. But under certain conditions, the slope might comeout to be positive. For example, consider an epoch in which theshare of a sharer was increased. But during the same epoch,some new sharers become heavily backlogged. In such situ-ations, the slope might come out to be positive. In our algo-rithm, we discard the positive slope and use the slope calcu-lated previously instead. Based on the value of slope, we cal-culate the amount of change in share required to achieve thedesired latency. In order to make sure that the transient condi-tions in an epoch do not cause the algorithm’s response (share)to change drastically, we keep the past history of shares by cal-culating an exponentially weighted moving average (EWMA)of shares [10]. We use a smoothing factor α, which controlsthe weightage given to the past history. The value of α should

System 1 chip, 8 processors per chipProcessor Technology 65 nm, 0.5ns cycle time

Processor configuration 4-wide, 64 instruction windowand 128 ROB entries,1KB YAGS branch predictor

L1 I/D cache 128KB, 4-way, 2 cyclesL2 cache 4MB, 16-way shared,

8-way banked, 9 cyclesDRAM configuration 1GB total size, 1 channel, 2 ranks,

8 banks, 32 Transaction Bufferentries, open page policy

Disk Latency Fixed 10ms

Table 1. Simulated configuration

be chosen in such a way that the algorithm is quick enough torespond to changing conditions, but keeps enough history tohave a stable response. Our experimental results show that avalue of 80 for α provides a good compromise of stability andresponsiveness. After we have calculated the share of the targetsharer, we divide the rest of the bandwidth equally among therest of the sharers.

Dynamically adjusting weights for the adaptive policy in-troduces a problem: the sharer which was assigned a smallweight in the previous epoch would have its finish tag registersand start tags of pending transactions set to high values. Thatwould mean that the sharer would be de-prioritized (at least forsome time) during the next epoch too, even if its weight is in-creased. Incorrect start tags of pending transactions is not a sig-nificant problem for DRAM, as the DRAM transaction queuesize is bounded (32 for this study). Time required for servingthe pending transactions would be a very small fraction of thetotal epoch time (see Zhang et al. [38] for similar arguments).For finish tag registers, we propose a simple solution: when-ever the weights of sharers are updated, the finish tag registerfor each sharer is reset to the maximum of all finish tags. Thusall new memory transactions would get start tags which wouldbe based on their new weight assignment.

For the implementation of this adaptive policy, we selectedepoch length of 2 million cycles. We found the execution timeof the algorithm to be 15 microseconds on average; thusthe timing overhead of the adaptive scheme is only 1.5%. Ourresults in Section 5.2 show that the adaptive policy successfullyachieves the target latencies.

Summary: We have proposed a scheme which enforces OSspecified shares, uses the DRAM bandwidth efficiently, avoidsstarvation and is flexible. We have also suggested a policy foradjusting weights to achieve target memory latency for a prior-itized sharer. Next, we describe the experimental methodologyused to evaluate our scheme.

4. Methodology

We used the Simics [21]-based GEMS [22] full systemsimulation platform for our simulations. The simulated sys-tem runs an unmodified Solaris operating system version 5.9

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

Page 7: [IEEE 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007) - Brasov, Romania (2007.09.15-2007.09.19)] 16th International Conference on Parallel

Workload Benchmarks BenignMix1 ammp, mcf, lucas, parser, vpr, twolf appluMix2 facerec, equake, swim, gcc, mgrid, gap apsiMix3 equake, mcf, ammp, swim, lucas, applu vprMix4 vpr, facerec, gcc, mgrid, parser, gap fma3dMix5 parser, gap, apsi, fma3d, perbmk, crafty art

Table 2. Workloads

and simulates a 8-core CMP. Each core is modeled by OPALmodule in GEMS. OPAL simulates an out-of-order processormodel, implements partial SPARC v9 ISA, and is modeled af-ter MIPS R10000. For our memory subsystem simulation, weuse the RUBY module of GEMS which is configured to simu-late an MOESI directory-based cache coherence protocol. Weuse a 4-way 128KB L1 cache. The L2 cache is 8-way banked,with a point-to-point interconnect between the L2 banks andL1. For the L2 cache, we used a modification of cache replace-ment scheme to support fairness as suggested by cache fair-ness studies [18, 29, 34]. This replacement scheme enforcesper processor shares in the L2 cache (we set equal shares forall processors). This scheme ensures that the shared cache isused efficiently while still providing performance isolation be-tween sharers. We used CACTI-3.0 [32] to model the latencyof our caches and DRAMsim memory system simulator [35]to model the DRAM system. We modified DRAMsim to im-plement the transaction scheduling policies studied in this pa-per. DRAMsim was configured to simulate 667 MHz DDR2DRAM devices, with timing parameters obtained from the Mi-cron DRAM data-sheets [24]. We used open page policy forrow buffer management (row buffer kept open after a columnaccess), instead of closed page policy (row buffer closed af-ter each column access), as our base case command schedulingpolicy exploits row buffer hits. Table 1 lists the key parametersof the simulated machine.

We used five multiprogrammed workloads for this study asshown in Table 2. These workloads include the most memoryintensive applications from SPEC 2000 suite as mentioned by[39]. In order to test the effectiveness of our performance iso-lation policy, we use a kernel that acts as a DRAM bandwidthhog. The hog creates a matrix bigger than the cache spaceand accesses its elements in such a way that its each access is amiss. For the experiments with hog, it replaces the benchmarklisted as “Benign” (Table 2) in the mixes. For our simulations,caches are warmed up for 1 billion instructions and the work-loads are simulated in detail for 100 million instructions. In or-der to isolate the results of our benchmarks from OS effects, wereserve one (out of eight) core for the OS and run benchmarkson the remaining 7 cores. We simulate the adaptive policy in-side the simulator without modifying the operating system.

5. Results

The two primary results of our experiments are:

1. Our fair queuing mechanism offer better performance

0

10000

20000

30000

40000

50000

MIX1MIX2

MIX3MIX4

MIX5Average

Maximum Sharer Service Latency

OBFDSFQ-1DSFQ-3DSFQ-5

DSFQ-10DSFQ-20FQ-VTVF

Figure 5. Worst-case Sharer Service Latencies

0

20

40

60

80

100

120

140

MIX1MIX2

MIX3MIX4

MIX5Average

Average Sharer Service Latency

OBFDSFQ-1DSFQ-3DSFQ-5

DSFQ-10DSFQ-20FQ-VTVF

Figure 6. Average Sharer Service Latencies

isolation compared to FQ-VTVF proposed by Nesbit etal. [27]. The improvements are visible in both fairness(improvement in maximum sharer service latency) andperformance (21% improvement in IPC).

2. The feedback-driven adaptive policy achieves latencieswithin 12% of the target. The policy shows a behaviorwhich is stable and responsive with the smoothing factorof 80.

We elaborate on these two results in the remainder of this sec-tion.

5.1. Performance Isolation

In this section, we will present results for experiments per-formed to study the performance isolation or fairness capabil-ities of different transaction scheduling policies. In these ex-periments, hog is included as a sharer and the share (φ) ofeach sharer is set to be equal. In order to analyze the fairnessand/or starvation behavior of different transaction schedulingpolicies, we measured sharer service latencies for each DRAMcommand. Sharer service latency for ith command from sharera is calculated as follows:

sharer service latencyai = t−MAX(arrivalai − last servicea)

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

Page 8: [IEEE 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007) - Brasov, Romania (2007.09.15-2007.09.19)] 16th International Conference on Parallel

0

1

2

3

4

5

MIX1MIX2

MIX3MIX4

MIX5Average

Normalized Performance OBF

DSFQ-1DSFQ-3DSFQ-5

DSFQ-10DSFQ-20FQ-VTVF

Figure 7. Normalized IPC

where t is the time that the command is issued, arrivalai

isthe arrival time for the command, and the last servicea is thetime when the last command from the same sharer was issued.

The strength of a fair queuing algorithms is often repre-sented using the worst case latencies observed by sharers underthat policy [3], as it highlights the possibilities of starvation.We adopt the same approach and present the maximum (worstcase) sharer service latencies of different transaction schedul-ing policies for all benchmarks in Figure 5. It can be observedthat the OBF policy has extremely high sharer service latencies.In Section 3.2, we introduced SPT as a measure to avoid star-vation and expect small values of SPT to avoid starvation. Theresults in Figure 5 clearly show that behavior. On average, theworst case service latencies increase with an increase in SPTvalue. One can notice that for some mixes (Mix1, Mix3 andMix5), the relationship between worst case service latency andSPT is not monotonic, i.e., a higher SPT has a smaller worstcase latency. The worst case latency depends on the dynamicsof the system and the worst case is not guaranteed to happenalways. It can be observed that FQ-VTVF has high worst casesharer service latencies, in some cases even higher than OBFand 15 times higher than DSFQ3 on average. There are multi-ple reasons behind this behavior (they were mentioned earlierin Section 2.1 and in Section 3.1). First, FQ-VTVF assigns fin-ish tags to commands of a sharer out-of-order, and second, abursty sharer can cause longer latencies for other sharers onceit becomes backlogged.

It should be noted here that the worst case behavior is notrepresentative of performance, as it might occur only rarely.For performance, average or common case behavior is consid-ered more relevant. Thus we shift our focus to average sharerservice latencies. Figure 6 plots average sharer service laten-cies for different transaction scheduling mechanisms. OBF re-sults in the worst latency of all mechanisms (68.7 cycles) onaverage, while DSFQ3 shows best latency (36.8 cycles). Itshould be noted here that the average sharer latency of DSFQ1

is slightly higher than DSFQ3. The reason is that DSFQ1

strictly enforces DRAM command ordering, and does not uti-lize the DRAM system very efficiently. FQ-VTVF has aver-age sharer service latency of 44.4 cycles on average, whichis 17.1% worse than DSFQ3. It should also be noted that

FQ-VTVF has better sharer service latencies on average thanDSFQ20 (which was not true for worst case sharer service la-tencies). This shows that FQ-VTVF has better average casebehavior than its worst case behavior.

Figure 7 shows the normalized IPC of all benchmarks withdifferent transaction scheduling mechanisms. It can be ob-served that performance trends are similar to the average sharerservice latency. DSFQ3 gets the best performance on average,followed by DSFQ1 and DSFQ5. DSFQ3 improves perfor-mance by 158% over OBF and by 21.7% over FQ-VTVF. FQ-VTVF shows better performance than DSFQ20 and improvesperformance by 102% over OBF. It should be noted that thebest performing policy is different for different mixes. ForMix1 and Mix3, DSFQ5 performs the best. In case of Mix2 andMix4, DSFQ3 has the best performance, while DSFQ1 worksbest for Mix5. That confirms the observations we made in Sec-tion 3.2. In some cases, serving request in the exact order dic-tated by start tags can hurt performance, as it would increasethe number of bank conflicts (as shown by Mix1, Mix2, Mix3

and Mix4 for DSFQ1). In other cases, prioritizing ready com-mands and row buffer hits can hurt the performance as the la-tency of other requests increases, as can be seen in Mix5. Ingeneral, we observed that SPT values in the 1 to 5 range areideal choices if SPT is to be statically chosen. One may alsoconsider a feedback-based adaptive approach (similar to theone described in Section 3.3) for automatically tuning the SPTfor any application mix. We do not study adaptive SPT tuningin this paper.

5.2. Adaptive Policy

In this section, we will present results for the adaptive pol-icy explained in Section 3.3. For these experiments, we replacehog by the benchmarks listed as Benign in Table 2. We restrictourselves to Mix1, Mix2 and Mix3 for these experiments, as theother two mixes do not have a significant level of contentionin memory. The target latency is set to be 600 cycles forthese experiments. We selected one benchmark from each mix(lucas, swim and ammp for Mix1, Mix2 and Mix3 respec-tively) as the prioritized application. In our simulations with-out adaptive policy, we noticed that lucas, swim and ammphad average memory latencies of 1267, 504 and 1273 cyclesin their respective mixes. Thus the target latencies are lowerthan the observed latencies for lucas and ammp and higherfor swim. We made this choice to study the capability of theadaptive algorithm for both kind of targets. A target higherthan actual latency can be useful, if the targeted sharer can bede-prioritized to improve the latencies of other contenders. Wechose DSFQ5 as the underlying transaction scheduling mech-anism for these experiments. We performed our experimentswith different values of smoothing factor (α in Figure 4) andfound that the smoothing factor of 80 provides a good com-promise between agility and stability of the algorithm.

Figure 8 plots the memory latencies achieved by the adap-tive policy for the three selected benchmarks. It can be ob-served that the algorithm reaches the target latencies in 3

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

Page 9: [IEEE 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007) - Brasov, Romania (2007.09.15-2007.09.19)] 16th International Conference on Parallel

0

200

400

600

800

1000

1200

1400

0 10 20

Total Memory Latency

epochs (2 million cycles each)

0

200

400

600

800

1000

1200

1400

0 10 20

Total Memory Latency

epochs (2 million cycles each)

0

200

400

600

800

1000

1200

1400

0 10 20

Total Memory Latency

epochs (2 million cycles each)

(a) Mix1 (b) Mix2 (c) Mix3

Figure 8. Latency Achieved by Adaptive Scheme

epochs for all the three mixes. For Mix2 and Mix3, there isa small overshoot. In the case of Mix1, the latency starts toincrease after reaching the target (due to increased number ofDRAM requests) but the algorithm detects that change and ad-justs the share accordingly. The algorithms sets the shares oflucas, swim, and ammp to 35%, 9% and 37% respectivelyin the steady state (share numbers are not shown in interest ofspace). The algorithm achieves average latencies within 12%,2.5% and 9% of the target latency for the three benchmarks.That results in an IPC improvement of 64% and 38.5% forlucas and ammp respectively; and a slight reduction in IPC ofswim with a little increase in the performance of other bench-marks in Mix2.

The above results show that the adaptive algorithm is sta-ble, responsive and achieves the target latencies. It can also beobserved that such an adaptive algorithm can be used for ef-fective prioritization, as it improves the IPC by up to 64% forone of the benchmarks. We also performed some experimentsin which the latency target was too low to be achieved. In suchcases, the adaptive policy settled at the closest achievable la-tency to target.

6. Related Work

Memory bandwidth management for chip multiprocessorshave recently attracted attention of researchers. Nesbit etal. [27] proposed a scheme for providing QoS in the mem-ory controller of a CMP. They implemented fair queuing, butthey used real time, instead of virtual time in their calculationof finish tags. They base this choice on the argument that athread which has not been backlogged in the past should begiven higher share if it becomes backlogged. We noticed thatif such a thread has a bursty behavior, it can cause starvation forother threads (as shown in our results in Section 5.1). Hencewe make the case that if a thread does not use excess band-width at the cost of other threads, it should not be penalized.This notion has been widely accepted in networking commu-nity [8]. Moreover, in their scheme, finish tag is assigned toeach command and is calculated using the actual time taken bythe command (which is not known until a command is issued).Thus in their scheme, memory references are not assigned fin-

ish tags in their arrival order. We make the observation that thiscan cause starvation and unexpectedly long latencies for somememory references (See Section 3.1 for further details). Nesbitet al.suggest that their scheme provides QoS but concede thatit cannot guarantee any limits on the overall latency observedby memory references. Our scheme not only solves the abovementioned problems, but also simplifies the implementation ofthe hardware mechanism.

Iyer et al. [14] used a simple scheme for DRAM band-width allocation, and refereed to the work by Nesbit et al. [27]for more sophisticated DRAM bandwidth management mech-anism. Moscibroda et al. [25] studied the fairness issues inDRAM access scheduling. They defined the fairness goal tobe “equalizing the relative slowdown when threads are co-scheduled, relative to when the threads are run by themselves”.Thus their scheme has additional overhead for keeping trackof row buffer locality of individual threads. Natarajan etal. [26] studied the impact of memory controller configura-tion on the latency-bandwidth curve. Zhu et al. [39] proposedthe memory system optimizations for SMT processors. Therehave been many proposals discussing memory access ordering[11, 23, 30, 31]. We have used an aggressive memory accessordering scheme as our base case in this study.

Some memory controller proposals provide methods ofsupporting hard real time guarantees on memory latencies[9, 20, 33]. These schedulers either rely on knowing the mem-ory access pattern and are not suitable for general purpose sys-tems or compromise efficiency for providing QoS. Our schemeis flexible as it lets the OS choose the right balance betweenefficiency and QoS. Moreover, it is based on feedback basedcontrol and do not make any assumptions about arrival patternof memory requests. Feedback based control of weights in fairqueuing is studied in other contexts [16, 38]. Our adaptivescheme is designed using the observations made by these stud-ies. (The design of a basic feedback-based control algorithm isnot claimed as our contribution).

7. Conclusion

In this paper, we proposed a scheme for managing theDRAM bandwidth. Our scheme is simple, fair and efficient.

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

Page 10: [IEEE 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007) - Brasov, Romania (2007.09.15-2007.09.19)] 16th International Conference on Parallel

Simplicity is achieved by treating each memory request as aunit of scheduling (instead of composed of multiple DRAMcommands). The simplicity of our scheme enables fairness byallowing us to use start time fair queuing with proven boundson worst case latency. The scheme utilizes DRAM efficientlyas it schedules commands to maximize row buffer hits, unlesssuch scheduling order starts conflicting with fairness goals. En-forcing DRAM bandwidth shares is not adequate to achievepredictable memory access latencies. Thus we introduced anadaptive algorithm which adjusts the weight assignment ofsharers to achieve desired latencies.

Our results show that our scheme provides very good av-erage and worst case sharer service latencies. It shows a per-formance improvement of 21% on average over FQ-VTVF. Ouradaptive policy achieves latencies within 12% of the target, andresults in performance improvement of up to 64%. In future,we plan to explore more sophisticated adaptive policies, whichtry to achieve target latencies for multiple sharers.

Acknowledgments

The authors would like to thank Asad Khan Awan for dis-cussions on network fair queuing. We would also like to thankanonymous reviewers for their feedback. This work is sup-ported in part by National Science Foundation (Grant no. CCF-0644183), Purdue Research Foundation XR Grant No. 6901285-4010 and Purdue University.

References

[1] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Hus-bands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf,S. W. Williams, and K. A. Yelick. The Landscape of ParallelComputing Research: A View from Berkeley. Technical ReportUCB/EECS-2006-183, EECS Department, University of Cali-fornia, Berkeley, December 18 2006.

[2] S. I. Association. ITRS Roadmap, 2005.[3] J. C. R. Bennett and H. Zhang. WF 2 q: Worst-case fair

weighted fair queueing. In INFOCOM (1), pages 120–128,1996.

[4] D. Burger, J. R. Goodman, and A. Kagi. Memory bandwidthlimitations of future microprocessors. In Proceedings of the23rd annual international symposium on Computer architec-ture, pages 78–89, New York, NY, USA, 1996. ACM Press.

[5] A. Demers, S. Keshav, and S. Shenker. Analysis and Simula-tion of a Fair Queueing Algorithm. In SIGCOMM ’89: Sym-posium proceedings on Communications architectures & proto-cols, pages 1–12, New York, NY, USA, 1989. ACM Press.

[6] E. M. Gafni and D. P. Bertsekas. Dynamic Control of SessionInput Rates in Communication Networks. IEEE Transactionson Automatic Control, 29(11):1009–1016, November 1984.

[7] S. J. Golestani. A Self-Clocked Fair Queueing Scheme forBroadband Applications. In INFOCOM, pages 636–646, 1994.

[8] P. Goyal, H. M. Vin, and H. Cheng. Start-time fair queueing:a scheduling algorithm for integrated services packet switchingnetworks. IEEE/ACM Trans. Netw., 5(5):690–704, 1997.

[9] H. P. Hofstee, R. Nair, and J.-D. Wellman. Token based DMA.United States Patent 6820142, November 2004.

[10] J. S. Hunter. The Exponentially Weighted Moving Average.Journal of Quality Technology, 18(4):203–210, 1986.

[11] I. Hur and C. Lin. Adaptive History-Based Memory Schedulers.In Proceedings of the 37th annual International Symposium onMicroarchitecture, pages 343–354, 2004.

[12] S. M. Inc. Ultrasparc IV processor architecture overview,Februrary 2004.

[13] A. Ioannou and M. Katevenis. Pipelined Heap (Priority Queue)Management for Advanced Scheduling in High Speed Net-works. IEEE/ACM Transactions on Networking, 2007.

[14] R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell,Y. Solihin, L. Hsu, and S. Reinhardt. QoS policies and architec-ture for cache/memory in CMP platforms. SIGMETRICS Per-form. Eval. Rev., 35(1):25–36, 2007.

[15] R. N. Kalla, B. Sinharoy, and J. M. Tendler. IBM Power5 Chip:A Dual-Core Multithreaded Processor. IEEE Micro, 24(2):40–47, 2004.

[16] M. Karlsson, C. Karamanolis, and J. Chase. Controllable fairqueuing for meeting performance goals. Performance Evalua-tion, 62(1-4):278–294, 2005.

[17] M. Karlsson, X. Zhu, and C. Karamanolis. An adaptive optimalcontroller for non-intrusive performance differentiation in com-puting services. In International Conference on Control andAutomation (ICCA), pages 709–714, 2005.

[18] S. Kim, D. Chandra, and Y. Solihin. Fair Cache Sharing andPartitioning in a Chip Multiprocessor Architecture. In IEEEPACT, pages 111–122, 2004.

[19] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-Way Multithreaded Sparc Processor. IEEE Micro, 25(2):21–29,2005.

[20] K.-B. Lee, T.-C. Lin, and C.-W. Jen. An efficient quality-awarememory controller for multimedia platform soc. IEEE Transac-tions on Circuits and Systems for Video Technology, 15(5):620–633, May 2005.

[21] P. S. Magnusson, M. Christensson, J. Eskilson, D. Fors-gren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, andB. Werner. Simics: A Full System Simulation Platform. IEEEComputer, 35(2):50–58, Feb. 2002.

[22] M. M. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty,M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A.Wood. Multifacet’s General Execution-driven MultiprocessorSimulator (GEMS) Toolset. SIGARCH Computer ArchitectureNews, 2005.

[23] S. A. McKee, A. Aluwihare, B. H. Clark, R. H. Klenke, T. C.Landon, C. W. Oliver, M. H. Salinas, A. E. Szymkowiak, K. L.Wright, W. A. Wulf, and J. H. Aylor. Design and Evaluation ofDynamic Access Ordering Hardware. In International Confer-ence on Supercomputing, pages 125–132, 1996.

[24] Micron. DDR2 SDRAM Datasheet.[25] T. Moscibroda and O. Mutlu. Memory Performance Attacks:

Denial of Memory Service in Multi-Core Systems. TechnicalReport MSR-TR-2007-15, Microsoft Research, February 2007.

[26] C. Natarajan, B. Christenson, and F. Briggs. A study of perfor-mance impact of memory controller features in multi-processorserver environment. In WMPI ’04: Proceedings of the 3rd work-shop on Memory performance issues, pages 80–87, New York,NY, USA, 2004. ACM Press.

[27] K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. FairQueuing Memory Systems. In MICRO ’06: Proceedings of the39th Annual IEEE/ACM International Symposium on Microar-chitecture, pages 208–222, Washington, DC, USA, 2006. IEEEComputer Society.

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007

Page 11: [IEEE 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007) - Brasov, Romania (2007.09.15-2007.09.19)] 16th International Conference on Parallel

[28] A. K. Parekh and R. G. Gallager. A generalized processor shar-ing approach to flow control in integrated services networks: thesingle-node case. IEEE/ACM Trans. Netw., 1(3):344–357, 1993.

[29] N. Rafique, W.-T. Lim, and M. Thottethodi. Architectural Sup-port for Operating System-Driven CMP Cache Management.In The Fifteenth International Conference on Parallel Architec-tures and Compilation Techniques (PACT), September 2006.

[30] S. Rixner. Memory Controller Optimizations for Web Servers.In Proceedings of the 37th annual International Symposium onMicroarchitecture, pages 355–366, 2004.

[31] S. Rixner, W. J. Dally, U. J. Kapasi, P. R. Mattson, and J. D.Owens. Memory access scheduling. In International Sympo-sium on Computer Architecture, pages 128–138, 2000.

[32] P. Shivakumar and N. P. Jouppi. CACTI 3.0: An IntegratedCache Timing, Power and Area Model. Technical report, West-ern Research Lab - Compaq, 2001.

[33] Sonics. Sonics MemMax 2.0 Multi-threaded DRAM AccessScheduler. DataSheet, June 2006.

[34] G. E. Suh, L. Rudolph, and S. Devadas. Dynamic Partition-ing of Shared Cache Memory. The Journal of Supercomputing,28(1):7–26, 2004.

[35] D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, andB. Jacob. DRAMsim: A memory-system simulator. SIGARCHComputer Architecture News, 33(4):100–107, September 2005.

[36] H. Zhang. Service disciplines for guaranteed performance ser-vice in packet-switching networks, 1995.

[37] L. Zhang. Virtual clock: a new traffic control algorithm forpacket switching networks. In SIGCOMM ’90: Proceedings ofthe ACM symposium on Communications architectures & pro-tocols, pages 19–29, New York, NY, USA, 1990. ACM Press.

[38] R. Zhang, S. Parekh, Y. Diao, M. Surendra, T. F. Abdelzaher,and J. A. Stankovic. Control of fair queueing: modeling, im-plementation, and experiences. In 9th IFIP/IEEE InternationalSymposium on Integrated Network Management, pages 177–190, May 2005.

[39] Z. Zhu and Z. Zhang. A Performance Comparison of DRAMMemory System Optimizations for SMT Processors. In Pro-ceedings of 11th International Sysmposium on High Perfor-mance Computer Architecture, pages 213–224, 2005.

16th International Conference onParallel Architecture and Compilation Techniques (PACT 2007)0-7695-2944-5/07 $25.00 © 2007