Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Copyright
by
Rajesh Ganesan
2017
The Thesis Committee for Rajesh GanesanCertifies that this is the approved version of the following thesis:
Reducing Cache Misses due to Frequent Context
Switching using a Cache Context Store
APPROVED BY
SUPERVISING COMMITTEE:
Jacob Abraham, Supervisor
Mark McDermott
Reducing Cache Misses due to Frequent Context
Switching using a Cache Context Store
by
Rajesh Ganesan, B.E.
Thesis
Presented to the Faculty of the Graduate School of
The University of Texas at Austin
in Partial Fulfillment
of the Requirements
for the Degree of
Master of Science in Engineering
The University of Texas at Austin
May 2017
Abstract
Reducing Cache Misses due to Frequent Context
Switching using a Cache Context Store
Rajesh Ganesan, M.S.E.
The University of Texas at Austin, 2017
Supervisor: Jacob Abraham
Computer system performance has been pushed further and further for
decades, and hence the complexity of the designs has been increasing as well.
This is true for both hardware and software. Problems at the interface of hard-
ware and software are particularly interesting, as are solutions which include
interaction between hardware and software. Cache misses due to frequent con-
text switching is one such problem and the solution proposed in this thesis is
to save the cache context of processes in a memory called a Cache Context
Store (CCS). The CCS is designed as a byte-addressable memory close to the
processor capable of holding multiple cache contexts. Speedup achievable by
such a system is calculated analytically using experimental data from running
SPEC CPU2006 benchmarks on a real system.
iv
Table of Contents
Abstract iv
List of Figures vii
Chapter 1. Introduction 1
Chapter 2. Context Switching 3
2.1 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Context Switch . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Performance Penalty due to Context Switching . . . . . . . . . 6
2.3.1 Direct Penalty . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Indirect Penalty . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 3. Related Work 8
3.1 Reducing the Performance Penalty due to Context Switching . 8
3.1.1 CLOCS, 1990 . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.2 ETS, 2014 . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Changing the Traditional Memory Hierarchy . . . . . . . . . . 12
3.2.1 CAMEO, 2014 . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 ThyNVM, 2015 . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.3 N3XT, 2015 . . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 4. Emerging Memory Technologies 15
4.1 Spin Transfer Torque RAM . . . . . . . . . . . . . . . . . . . . 16
4.2 Resistive RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Phase Change Memory . . . . . . . . . . . . . . . . . . . . . . 19
4.4 eDRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.5 Design Space Exploration using DESTINY . . . . . . . . . . . 21
v
Chapter 5. Architecture Description & Evaluation 28
5.1 Cache Context Store . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Design Decisions for Real Systems . . . . . . . . . . . . . . . . 33
5.3 Experiment Setup & Results . . . . . . . . . . . . . . . . . . . 34
5.4 Speedup Relative to the Worst Case Performance . . . . . . . 43
Chapter 6. Limitations & Future Improvements 46
6.1 Limitations of the Proposed Design . . . . . . . . . . . . . . . 46
6.2 Feasibility for Deployment of this Solution in a Real System . 47
6.3 Possible Future Improvements . . . . . . . . . . . . . . . . . . 47
Bibliography 49
vi
List of Figures
2.1 Address Space of a Process . . . . . . . . . . . . . . . . . . . . 5
2.2 Timer Interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 Traditional Memory Hierarchy Vs. CLOCS Memory Hierarchy 9
3.2 Impact of CS misses on different workloads . . . . . . . . . . . 10
4.1 STT-RAM Storage Cell . . . . . . . . . . . . . . . . . . . . . 16
4.2 ReRAM Storage Cell . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 PCRAM Storage Cell . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 Comparison of Emerging Memory Technologies . . . . . . . . 20
4.5 An Overview of the DESTINY Framework . . . . . . . . . . . 21
4.6 Write Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.7 Write Latency for PCRAM . . . . . . . . . . . . . . . . . . . . 23
4.8 Read Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.9 Read Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.10 Write Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.11 Read Dynamic Energy . . . . . . . . . . . . . . . . . . . . . . 25
4.12 Write Dynamic Energy . . . . . . . . . . . . . . . . . . . . . . 26
4.13 Write Dynamic Energy for PCRAM . . . . . . . . . . . . . . . 26
4.14 Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1 Fields in a Physical Address . . . . . . . . . . . . . . . . . . . 29
5.2 A Generic m-way Set Associative Cache . . . . . . . . . . . . 30
5.3 CCS Address Look-up Table . . . . . . . . . . . . . . . . . . . 31
5.4 Cache Context Store Update Mechanism . . . . . . . . . . . . 32
5.5 Variation of Cache Misses, Context Switches and Runtime . . 38
5.6 Speedup for a 64 MB 3D eDRAM Cache Context Store . . . . 45
vii
Chapter 1
Introduction
The objective of this thesis is to develop techniques for improving the
performance of a computer system by identifying non-obvious recurring events
that cause a degradation in maximum achievable performance. It proposes a
potential solution that can eliminate or at least mitigate this degradation.
Since these events are recurring, even a simple solution could have an impact
larger than would be expected.
To be more specific, the recurring event mentioned above is a Context
Switch, and the degradation is due to the CPU stalling cycles because of cache
misses due to the context switching. The proposed solution is to eliminate
these cache misses using a Cache Context Store (CCS) that can save the cache
contexts of multiple processes. The Cache Context Store is proposed as a
tightly integrated byte-addressable memory entirely managed by the hardware
and invisible to the software. A design exploration study is performed to find
a good candidate for the type of memory technology to be used for the Cache
Context Store.
The rest of this thesis is arranged as follows. In Chapter 2, basic princi-
ples of context switching are explained. In Chapter 3, prior work on handling
1
cache misses due to context switching is described and a few published hybrid
memory systems are surveyed to establish familiarity with such emerging mem-
ory hierarchies. In Chapter 4, potential memory technologies for the proposed
Cache Context Store are surveyed. In Chapter 5, the proposed architecture is
described and results are analyzed. In Chapter 6, limitations of this design are
mentioned and potential future design choices to mitigate them are suggested.
2
Chapter 2
Context Switching
An Operating System (OS) is the software that manages a computer
system and allows programs to judicially make use of the available hardware
resources such as CPU, Memory, I/O, etc. It primarily provides virtualiza-
tion, concurrency and persistence [1]. Virtualization provides the illusion to
each running program that it has all the hardware resources available in the
system for itself. Concurrency is the ability to allow multiple applications to
seemingly run at the same time. Persistence is necessary to save volatile main
memory contents to a non-volatile memory such as a hard disk or an SSD.
This is necessary because programs which need more memory than the avail-
able physical memory can produce correct results. It is also necessary so that
the system state is consistent across machine reboots.
2.1 Process
A process is a running instance of a program. It has its own address
space. This is shown in Figure 2.1 taken from Operating Systems: Three Easy
Pieces [1]. At any time, there are multiple processes ready to run if given
the chance. All processes time share the available hardware resources and the
3
OS decides how to schedule them. The context of a process consists of the
program counter, general purpose registers, floating point registers, and its
process ID. This context of a process resides in the Operating System kernel
memory in a data structure called Process Control Block (PCB). When a
process A is suspended and another process B starts (or resumes) executing,
this is called a context switch. The PCB of the suspended process A is saved
from corresponding registers to memory, and the PCB of the resuming process
B is copied from memory to the corresponding registers. The OS scheduling
can be either co-operative or pre-emptive. In co-operative scheduling, a process
yields to others. In pre-emptive scheduling, there is a time slice allocated for
each process and once the time slice expires, the OS schedules another process
for the next time slice. Mac OS 9 was the last major OS to use co-operative
scheduling, but today all modern OSs use pre-emptive scheduling including
Mac OS X.
4
Figure 2.1: Address Space of a Process
2.2 Context Switch
As shown in Figure 2.2 taken from [1], during a timer interrupt that
causes a context switch, the following happens.
1. Process A’s CPU register state is saved to its PCB in kernel memory.
2. The scheduler chooses a Process B from the list of available processes.
3. Process B’s PCB is copied to the CPU registers from kernel memory.
5
Figure 2.2: Timer Interrupt
2.3 Performance Penalty due to Context Switching
There are two distinct types of performance penalty due to context
switching. They are explained below.
6
2.3.1 Direct Penalty
This is the sum of time spent in copying the CPU register state of
Process A to memory, and in copying the saved CPU register state of Process
B to the actual CPU registers. This is very deterministic and does not depend
on the process getting switched in or out. In today’s CPUs, the direct penalty
is in the order of few hundred nanoseconds given that the clock speeds are in
the GHz range.
2.3.2 Indirect Penalty
Every time a context switch occurs, the old cache contents become
obsolete for the new process. The new process that is resuming execution
takes more cache misses than it would have if it had not been interrupted.
Since the CPU state is only a few registers, the direct penalty is relatively
small when compared to the indirect penalty, especially with increasing cache
sizes [2]. This is not deterministic and changes from process to process. So,
the indirect penalty could be hard to quantify. It depends on the following.
• The memory accesses made by a process.
• The phase a process is currently in with respect to memory accesses.
7
Chapter 3
Related Work
This chapter discusses related work, and is divided into two sections.
The first section deals with prior work on reducing the performance penalty
due to context switching and the second section deals with published hybrid
memory systems where the traditional memory hierarchy is augmented using
newer memory technologies.
3.1 Reducing the Performance Penalty due to ContextSwitching
In section 2.3, the performance penalty due to Context Switching was
classified into direct and indirect penalty. The direct penalty is the cost of
moving the context of the registers in and out. The indirect penalty is the
cost of warming up the cache every time a context switch occurs. In the
following two subsections, the first discusses techniques to reduce the direct
penalty, and the second addresses ways of reducing the indirect penalty.
3.1.1 CLOCS, 1990
CLOCS stands for Computer for LOw Context Switch time [3]. The
main idea in this work is that the context switch time can be reduced by
8
reducing the size of the ‘context’. The proposed architecture reduces the size
of the ‘context’ to just one register, i.e., the program counter. It is a memory
to memory architecture, where each instruction uses memory addresses as the
operands, thereby not needing a set of general purpose registers at all.
Figure 3.1: Traditional Memory Hierarchy Vs. CLOCS Memory Hierarchy
As shown in Figure 3.1 from [3], the CLOCS architecture removes the
caches from the memory hierarchy while retaining support for virtual memory.
This effectively eliminates updates to the memory hierarchy necessary due to
a context switch (for example, updating TLB and MMU with new virtual-to-
physical address mappings, writing back dirty cache lines to memory, etc.). It
also eliminates the indirect penalty of warming up the caches by removing the
caches altogether.
9
3.1.2 ETS, 2014
ETS stands for Elastic Time Slicing [4]. Here, the system considered
is an aggressively multi-tasked virtualized environment. The specific problem
solved was the performance impact of cache misses due to displaced cache
state after Context Switching (called CS misses). In the proposed solution,
ETS, the OS dynamically finds a time slice duration to optimize for system
performance instead of having a fixed time slice for each process, by utilizing
the CS miss count exposed to it by the hardware.
Figure 3.2: Impact of CS misses on different workloads
Figure 3.2 is from [4] where, (a) shows a workload that is suffering very
low performance penalties due to context switching, and (b) shows a workload
that suffers relatively more cache misses than (a), and (c) shows that if the time
slice for workloads like (b) is increased, then it can sustain peak performance
10
for longer periods.
A balance between performance and latency is found and constantly re-
evaluated using dynamic hardware measurements of the impact of CS misses
using an Auxillary Tag Directory in addition to the Main Tag Directory. An-
other interesting section in this paper makes an observation that cache opti-
mizations such as advanced cache replacement policies (e.g., Re-Reference In-
terval Prediction [5]) exacerbate this problem. Under the related work section
in this paper, a subsection titled Performance Impact of Context Switching
contains this note quoted below.
“There were many studies which aimed at understanding the perfor-
mance impact of CS events. They considered the influence of additional cache
misses and page faults on performance. The proposed solutions include job
speculative prefetching, CPU scheduling guided by memory scheduling, and
intelligent process scheduling. Most of the works concluded that the indirect
overhead, due to cache perturbation, associated with CS events is significant.
We attempted to address this overhead through our proposal.”
This corroborates the motivation for this thesis since it also tries to
solve this indirect overhead.
11
3.2 Changing the Traditional Memory Hierarchy
Chapter 4 explains the need for changing the traditional memory hier-
archy and to accommodate emerging memory technologies so that these newer
technologies could fill gaps in the traditional memory hierarchy. Here, a few
such hybrid memory systems are described to help in understanding the mem-
ory hierarchy envisioned for this thesis.
3.2.1 CAMEO, 2014
CAMEO stands for CAche-like MEmory Organization [6]. Chang-
ing the memory hierarchy to accommodate emerging memory technologies
poses a fundamental question to the computer architect. A key question is
whether to expose these memories to the software or to manage them micro-
architecturally, (i.e., invisible to the software). The focus here is on stacked
DRAM memories and how to find a place for them in the traditional memory
hierarchy. The proposed solution uses the stacked DRAM memory available in
the system efficiently at a fine granularity like a cache, while it also augments
the main memory visible and available to the software. It makes the stacked
DRAM usable as a cache in the traditional sense wherein data can be accessed
at a fine cache line granularity and at the same time it also does not hide it
from the software and makes it part of the software visible main memory. It is
possible to do this because the design can dynamically move cache lines from
the stacked DRAM to Main Memory and vice-versa. This paper also proposes
a Line Location Table and a Line Location Predictor to track the location of
12
all data lines.
3.2.2 ThyNVM, 2015
ThyNVM stands for Transparent Hybrid NVM [7]. This work also
addresses the problem of accommodating emerging NVM technologies in the
existing memory hierarchy. A few lines from the abstract of this paper is
quoted below.
“Emerging byte-addressable nonvolatile memories (NVMs) promise per-
sistent memory, which allows processors to directly access persistent data in
main memory.”
It also addresses the concern of increased programmer effort to take
advantage of the persistent nature of emerging NVM technologies. In today’s
systems, an NVM library provided by the hardware manufacturer is needed
to utilize persistent storage directly from a program [8]. This library typically
provides a set of APIs to directly malloc persistent memory, espcially targeted
at server applications such as databases. The specific problem solved by this
paper is this need of the software (or programmer) to ensure memory consis-
tency in the event of a power failure or a system crash. The proposed solution
provides software transparent crash consistency using a hardware assisted hy-
brid DRAM+NVM persistent memory design.
13
3.2.3 N3XT, 2015
N3XT is a Nano-Engineered Computing Systems Technology [9] that
is targeted at providing energy efficiency for abundant data workloads. It
promises 1000 times improvement in energy efficiency by integrating vastly
different technologies into a new form of computing system. The main ideas
proposed in this work are the following.
1. Use energy-efficient 1D carbon nanotube transistors (CNT FETs).
2. Use a hybrid memory system with 3D RRAM and STT-MRAM to derive
the benefits of both massive storage and quick access.
3. Exploit fine-grained monolithic 3D integration of compute and memory
components using ultradense nanoscale vias.
4. Take computation nearer to the memory and interleave memory and
computation for optimum latency while employing efficient nanoscale
heat removal methods.
In such a system, the software and hardware interact dynamically to
optimize for the design point; the design point could be either power or per-
formance. So, improvements in the context switch performance will improve
the energy efficiency of the system at both these design points.
14
Chapter 4
Emerging Memory Technologies
From an ideal memory, we expect infinite capacity and zero latency.
No single memory technology can provide both requirements. This is because,
very often, these requirements oppose each other. For example, a particular
memory technology might be able to provide high capacity but might not be
able to provide low latency (e.g., disks) or a particular memory technology
might be able to provide low latency but might not be able to provide high
capacity (e.g., on-chip caches). Hence, in most systems, we see a memory
hierarchy where, closer to the CPU we have a smaller but faster memory; as
we go further away from the CPU, the memories tend to become larger, but
also slower compared to the ones closer to the CPU. DRAM technology scaling
has slowed considerably and we see reliability and security problems due to
the extent of this scaling. One example of this problem demonstrated in [10]
which shows that by repeatedly accessing a row in a DRAM chip, we can flip
bits in adjacent rows.
For future systems, there is a need to accommodate emerging memory
technologies such as Phase Change Memory, Spin Transfer Torque RAM, Re-
sistive RAM, and eDRAM within the existing memory hierarchy. These new
15
memory technologies are byte-addressable and are also non volatile with the
exception of eDRAM. Therefore, they can either replace portions of the ex-
isting memory hierarchy or can augment it. Hybrid memory systems [11–13]
have been proposed to leverage the best of multiple technologies. In tradi-
tional charge storage based memories (e.g., DRAM, Flash) data is written
by capturing charge Q, and read by detecting the voltage V. With scaling,
the charge stored decreases from generation to generation, and the detection
circuitry limits further scaling. Conversely, in resistive memories (e.g., PCM,
STT-RAM, ReRAM), data is written by pulsing current dQ/dt and read by
detecting resistance R. In the following sections, data storage mechanisms of
these emerging memory technologies are described and the pictures [14–16]
are shown for visualization.
4.1 Spin Transfer Torque RAM
Figure 4.1: STT-RAM Storage Cell
16
STT-RAM uses the resistance difference between two configurations of
a Magnetic Tunnel Junction (MTJ) to store a “1” or a “0”. A Magnetic Tun-
nel Junction consists of a thin non-magnetic material (1 nm to 2 nm thick)
wedged between two ferro-magnetic materials (2 nm to 5 nm thick). One of
the layers is called the reference or fixed layer which has a fixed magnetic po-
larization and the other is called the free layer and has a magnetic polarization
that can be changed by applying a current to it. The resistance of the MTJ is
up to 600% higher when the two layers are polarized in the opposite direction
as compared to when they are polarized in the same direction [17]. This dif-
ference in resistance is further increased if the non-magnetic material between
the two ferro-magnetic materials is an electrical insulator. The conductivity
modulation is attributed to Tunnelling Magneto-Resistance (TMR), a quan-
tum mechanical phenomenon. This effect is due to spin-polarized electrons
tunnelling between the ferro-magnetic layers through the insulator. Read and
write latencies are around 10 ns and the endurance is more than 1014 cycles.
17
4.2 Resistive RAM
Figure 4.2: ReRAM Storage Cell
In ReRAM, an insulating dielectric is used to form a conducting path
between two metal electrodes by applying a sufficiently high voltage between
them. Oxygen vacancies in such a conducting path leads to a low resistance
which represents a “1” and when this path is broken, the resistance increases
and hence the state represented is a “0” [18]. Read latency is 6 to 8 ns and
write latency is around 20 to 30 ns. ReRAM has a low write endurance of 1011
cycles which is very often a limiting factor.
18
4.3 Phase Change Memory
Figure 4.3: PCRAM Storage Cell
Phase Change Memory material like GeSbTe (GST) can exist either
in amorphous or in crystalline form depending on the current pulse duration
applied to it. In the amorphous form it has a high resistance (106 − 107 Ω)
and in the crystalline form it has a low resistance (103 − 104 Ω) [11]. A write
operation consists of injecting current to change the phase of the material. A
SET pulse is a sustained current pulse (∼150 ns) to heat the cell above Tcryst
(∼300 °C) and RESET pulse is a comparatively shorter current pulse (∼100
ns) to heat it above Tmelt (∼800 °C). A read operation consists of detecting
the phase using the difference in resistance between the two phases. Because
of this, an important difference between PCM and other memory technologies
is that PCM has two different write latencies for SET and RESET operations
19
while the other memory technologies have an uniform latency regardless of
whether the write data is a 1 or a 0. PCM arrays are non-volatile and can
retain data for more than 10 years at 85 °C but have a very low endurance
which is capped at 109 cycles. Multi level cells with up to 4 bits per cell have
been demonstrated as prototypes. So, the density of such memories can be
very high.
4.4 eDRAM
Similar to a conventional DRAM, embedded DRAM consists of a ca-
pacitive storage element but they can be integrated on the same die as the
processor. Read and Write latencies are around 2 ns to 3 ns which is more
than an order of magnitude lower than conventional DRAMs [18]. Since the
stored charge leaks, a refresh is necessary periodically. The retention time is
around 4 ms for eDRAM whereas it is typically 64 ms for commodity DRAM
parts. So, eDRAM needs more frequent refreshes than DRAM. This causes
issues in scaling eDRAM to future technology nodes due to difficulty in precise
charge placement and data sensing.
STT-RAM ReRAM PCM eDRAMCell Element 1T1R 1T1R 1T1R 1T1CCell Area (F 2) 6 4 4 38Read Latency 10 ns 6-8 ns 12 ns 2 nsWrite Latency 10 ns 20-30 ns 100-150 ns 2 nsRetention >10y >10y >10y 4msEndurance (rw cycles) 1014 1011 109 1016
Figure 4.4: Comparison of Emerging Memory Technologies
20
4.5 Design Space Exploration using DESTINY
DESTINY is a 3D dEsign Space exploraTIon tool for SRAM, eDRAM
and Non-volatile memorY [19]. It can model 2D and 3D SRAM, eDRAM and
ReRAM memory structures. It can also model 2D structures of STT-RAM and
PCM memories. The user can configure it to optimize for various optimization
targets such as latency, area, leakage, refresh rate, and energy delay product.
An overview of the tool’s framework is shown in Figure 4.5 taken from
[20]. DESTINY can also be configured to model various types of regular mem-
ory array structures such as a RAM, or a CAM or a Cache. Further, it can
model process technologies ranging from 22nm to 180nm. Another parameter
that is configurable in the tool is the data width.
Figure 4.5: An Overview of the DESTINY Framework
In order to identify a particular memory technology for the Cache Con-
text Store, the tool is configured to optimize for latency. In this thesis, the type
of memory structure modeled is specified as a RAM and the process technol-
21
ogy is specified as 22nm. Using a python script, different configuration files are
generated for a range of memory sizes from 1MB to 128MB, and also for each
memory technology type except SRAM. Further, the data width was sweeped
across 64b, 128b and 256b configurations. In these simulations, changing the
data width did not change the read and write latencies considerably but had
a direct relation to the read and write bandwidths. Therefore, the reported
results are for the 256b bus width configuration. Since PCRAM has different
write latencies depending on whether the written bit is a 0 (RESET) or a 1
(SET), write latency and write dynamic energy plots are given separately for
PCRAM.
From Figures 4.6 to 4.14, it can be seen that 3D eDRAM is the best
candidate when all parameters are considered for the Cache Context Store de-
sign. It has the best balance between read and write latencies for the memory
sizes considered; that means the cache context can be stored and retrieved in
about the same time. From Figures 4.6 and 4.8, the read and write latency
for a 64MB eDRAM is around 2 ns to 4 ns. It also has a read and write
dynamic energy (∼ 1nJ) that is comparitively lower than the other memory
technologies. The leakage power is higher in eDRAM (∼ 3W) as compared
to other technologies due to the need to refresh the stored data periodically.
However, since the Cache Context Store is not going to be always needed as
in the case of main memory, it can be sent to a low power mode when not in
use. Similarly, the endurance of eDRAM is 1016 cycles, which is many orders
of magnitude better than other memory technologies as shown in Figure 4.4.
22
3D eDRAM 2D eDRAM STTRAM 2D ReRAM 3D ReRAM
0
2
4
6
8
10
Write
Latency
(ns)
1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB
Figure 4.6: Write Latency
RESET Latency SET Latency
0
50
100
150
200
Write
Latency
(ns)
1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB
Figure 4.7: Write Latency for PCRAM
23
3D eDRAM 2D eDRAM STTRAM 2D ReRAM 3D ReRAM PCRAM
0
2
4
6
8Rea
dLatency
(ns)
1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB
Figure 4.8: Read Latency
3D eDRAM 2D eDRAM STTRAM 2D ReRAM 3D ReRAM PCRAM
0
200
400
600
Rea
dBandwidth
(GB/s)
1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB
Figure 4.9: Read Bandwidth
24
3D eDRAM 2D eDRAM STTRAM 2D ReRAM 3D ReRAM PCRAM
0
100
200
300
400
Write
Bandwidth
(GB/s)
1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB
Figure 4.10: Write Bandwidth
3D eDRAM 2D eDRAM STTRAM 2D ReRAM 3D ReRAM PCRAM
0
1,000
2,000
3,000
4,000
Rea
dDynamic
Energy(p
J)
1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB
Figure 4.11: Read Dynamic Energy
25
3D eDRAM 2D eDRAM STTRAM 2D ReRAM 3D ReRAM
0
1,000
2,000
3,000
4,000W
rite
Dynamic
Energy(p
J)
1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB
Figure 4.12: Write Dynamic Energy
RESET Dynamic Energy SET Dynamic Energy
30
32
34
36
38
40
Write
Dynamic
Energy(n
J)
1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB
Figure 4.13: Write Dynamic Energy for PCRAM
26
3D eDRAM 2D eDRAM STTRAM 2D ReRAM 3D ReRAM PCRAM
0
2,000
4,000
6,000
8,000
Lea
kagePower
(mW
)
1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB
Figure 4.14: Leakage Power
27
Chapter 5
Architecture Description & Evaluation
The main idea of this thesis is to solve the cache cold start problem
due to context switching using a cache context store made out of an emerging
memory technology. This chapter discusses the micro-architectural changes
needed for such a system and shows the speedup that can be achieved for
SPEC CPU2006 benchmarks. In order to simplify the problem and come up
with a primitive solution, the system considered here has the following features.
• A single core CPU without simultaneous multithreading.
• A single level cache that is divided between instruction and data.
• A write-through policy so that there are no dirty lines in the cache.
Figure 5.4 shows the control flow for saving the cache context into the cache
context store and retrieving it. Once a context switch is identified by the
hardware, the PID of the old process is saved to an address look-up table
along with the address of the memory location in which the cache context is
going to be stored. Then, the cache context is saved to the cache context store
at that address. If the PID of the new process is present in the address look-up
table then the stored cache context is copied back from the cache context store
28
using the corresponding address field. The address look-up table is shown in
Figure 5.3. Some of the problems associated with more advanced systems and
the design decisions that need to be made for such real systems are discussed
in Section 5.2.
5.1 Cache Context Store
For a byte addressable memory, a generic m-way set associative cache
with p bits of physical address, divided into t bits of tag, i bits of index and
o bits of offset, and has the following parameters.
• A cache line size of 2o bytes.
• 2i sets.
• m cache lines in a set.
• m ∗ 2i cache lines.
p − 1 0Physical Address: Tag Index Offset
t bits i bits o bits
Figure 5.1: Fields in a Physical Address
A cache is typically organized into tag and data stores as shown below in
Figure 5.2, if the tag store has u additional bits for bookkeeping (replacement
policy, coherence policy etc.), then
29
• the data store size is m ∗ 2i ∗ 2o bytes,
• and the tag store size is m ∗ 2i ∗ (t + u) bits.
The total cache context is now given by m ∗ 2i ∗ (2o + ((t + u)/8))
bytes. This also includes the cache lines that are invalid, which need not
be saved. Assuming there are n cache lines that are invalid during the con-
text switch period, then the cache context that needs to be saved is given by
((m ∗ 2i) − n) ∗ (2o + ((t + u)/8)) bytes.
Tag Store
Way: 1 2 3 . . . m − 1 m
Set 1
Set 2
Set 3
.
.
.
Set 2i−1
Set 2i
Data Store
Way: 1 2 3 . . . m − 1 m
Set 1
Set 2
Set 3
.
.
.
Set 2i−1
Set 2i
Figure 5.2: A Generic m-way Set Associative Cache
Once a context switch is detected from an ISA specific register update
(e.g., the CR Control Registers in x86), if the cache tag and data stores can be
30
accessed as a byte addressable memory, then a cache context save can be done
by copying the contents to a predetermined location in the Cache Context
Store. We can compare this data movement overhead to the performance
penalty due to context switch caused cache misses to understand if there is a
potential benefit in storing the cache context.
The Cache Context Store mentioned above is envisioned to be a byte
addressable memory that can hold several tens of cache contexts. It should
have a reasonable read and write latency. In order to save and retrieve the
cache contexts from this memory, a look up table is necessary to store the
process ID and the address corresponding to the saved cache context of every
process in the Cache Context Store. The hardware implementation for this
look up table is similar to that of a Translation Lookaside Buffer (TLB), where
the data in the PID column need to be compared with that for the current
PID using a content addressable memory to provide a hit or miss output.
PID Address123...x
Figure 5.3: CCS Address Look-up Table
31
Cache Context Save Trigger from Control Register
Save Old PID to CCSAddress Look-up Table
Save Cache Context toCCS & Update LUT Entry
New PID inCCS Address
Look-up Table?
Copy Cache Contextfrom CCS to the Cache
Continue Execution
no
yes
Figure 5.4: Cache Context Store Update Mechanism
32
5.2 Design Decisions for Real Systems
For the experiment and results in the following sections, we consider
a very simple system in order to keep the complexity tractable. For a single
core CPU with only L1 instruction and data caches, and a write through
policy, there are not many tradeoffs to make in the implementation. However,
in a real system, with a multi core CPU, each CPU might have one or two
levels of private cache and all the cores might typically share a larger last
level cache. There has been research done on how to improve performance by
partitioning shared caches based on utility [21]. Further, if these caches utilize
a writeback policy instead of a write through policy, a number of questions
need to be answered to make the Cache Context Store architecture discussion
more complete for such real systems.
1. Is it enough to back up the private caches only? Or is it also necessary
to back up the cache lines in the shared last level cache?
2. Is it necessary to evict the dirty cachelines from the cache and store only
clean cache lines in the Cache Context Store? Or, can the saved cache
context contain dirty lines too?
3. In a shared memory multi processor, there could be communication be-
tween processes by referring to the same physical page from two different
virtual pages. In such a system, there could be accesses to stored cache
lines of a particular process from another process. The memory model
should define if shared pages are cacheable and if they are, the cache
33
controller and the cache context store need to make sure that the shared
cache lines are up to date.
These are some examples of the complexity involved in implementing this
scheme in a real system.
5.3 Experiment Setup & Results
In order to quantitatively evaluate the performance impact of context
switches, an experiment to determine the increase in cache misses due to con-
text switching on SPEC CPU2006 benchmark suite was done. The experiment
comprised of running a subset of the SPEC CPU2006 benchmarks [22] on an
AMD Phenom(tm) II X6 1055T processor and collecting the number of cache
misses, context switches and runtime using the on-chip performance monitor-
ing counters from the perf tool available in linux. The machine is a 6-core
processor with each core having a 64KB of L1 I and D caches, and also a
512KB L2 cache. All cores share a 6144KB L3 cache and the processor does
not support simultaneous multi-threading.
Since the processor in this system has 6 cores, care was taken to en-
sure that the benchmark and the co-executing processes run on the same core
throughout the experiment, thus making sure that an increase in the num-
ber of instances of the co-executing processes increases the number of context
switches proportionally, since the processes are time-sharing a single core. A
Linux tool called taskset was used to ensure this. This processor has a 6MB
34
shared last level cache and since only one process was allowed to run at any
time, that process had the cache for itself at all times. Care was also taken
to run only the benchmark and the co-executing processes on the system, this
included logging off the desktop system UI and using only a light weight ssh
connection so that the desktop system UI does not interfere with the measure-
ments.
1#inc lude<s t d i o . h>2#inc lude<s t d l i b . h>3
4 i n t main ( i n t argc , char ∗ argv [ ] ) 5 i n t ∗ array ;6 i n t i , j ;7 array = ( i n t ∗) mal loc (2000000 ∗ s i z e o f ( i n t ) ) ;8 f o r ( i =0; i <2000000; i++)9 array [ i ] = i ;
10 whi le (1 ) 11 f o r ( i =0; i <2000000; i=i +16)12 f o r ( j =1; j <16; j++)13 array [ i ] = array [ i ] + array [ i+j ] ;14 15 re turn 0 ;16
Listing 5.1: Code to Flush the Cache
Each benchmark was run along with 0 to 9 instances of a co-executing
process that flushes useful data in the cache left by the benchmark after each
time slice. The co-executing processes ensure that the setup simulates a sce-
nario where an application is running in an aggressive multitasking environ-
ment which is when the cache misses due to context switches impact appli-
cation performance noticeably. The code in Listing 5.1 was used to flush the
35
cache to simulate the above scenario. Running an increasing number of in-
stances of this process ensured that the number of cache misses due to context
switching increases proportionally as expected.
In the following plots, we can see the results of this experiment. The
cache misses are expressed in millions, the number of context switches is ex-
pressed in thousands and the runtime is expressed in seconds. We can see that
for some benchmarks – bzip2, astar, gcc, hmmer, the cache misses, con-
text switches and run time increases as the number of co-executing processes
is increased. For a few benchmarks – lbm, mcf, sphinx3, xalancbmk, the
cache misses and context switches does not follow the expected trend, although
the performance impact is still very clear from the runtime plot. xalancbmk
had a runtime of 350 seconds while having the CPU for itself. This increased
to 420 seconds when there were 9 instances of the co-executing process run-
ning alongside the benchmark. This is almost 20% increase in runtime directly
related to the effect of cache misses due to context switching.
As the number of co-executing processes increased, the perceived load
in the system increased, and hence the time slice allocated for each process
was shortened by the Linux scheduler. For most of the benchmarks, the aver-
age time slice allocated was 11.82ms when running on its own. This reduced
to 8ms, 6ms, 4ms, 3ms, 2ms or 1ms depending on the load in the system.
Such behaviour is very representative of most heavily loaded aggressive multi-
tasking systems, where the time slices allocated to a process changes dynam-
ically depending on the load. As the time slice reduces, the number of time
36
slices needed by a benchmark increases. This also increases the performance
penalty due to a cold cache at the beginning of each time slice.
37
0 2 4 6 8 10 12
8
10
12
14·106
No. of Processes
Cach
eMisses
0 2 4 6 8 10 12
20
40
60
80
·103
No. of Processes
ContextSwitch
es
0 2 4 6 8 10 12
115
120
125
No. of Processes
Runtime(s)
(a) bzip2
0 2 4 6 8 10 12
900
905
910
915
920
·106
No. of Processes
Cach
eMisses
0 2 4 6 8 10 12
5
10
15
·103
No. of Processes
ContextSwitch
es
0 2 4 6 8 10 12
68
68.5
69
No. of Processes
Runtime(s)
(b) perlbench
0 2 4 6 8 10 12
15
20
25
30
·106
No. of Processes
Cach
eMisses
0 2 4 6 8 10 12
100
200
300
·103
No. of Processes
ContextSwitch
es
0 2 4 6 8 10 12250
260
270
280
290
No. of Processes
Runtime(s)
(c) astar
Figure 5.5: Variation of Cache Misses, Context Switches and Runtime
38
0 2 4 6 8 10 12
256
258
260
262
·106
No. of Processes
Cach
eMisses
0 2 4 6 8 10 12
10
20
30
40
50·103
No. of Processes
ContextSwitch
es
0 2 4 6 8 10 12
51
52
53
54
No. of Processes
Runtime(s)
(d) gcc
0 2 4 6 8 10 12
160
180
200
220
·106
No. of Processes
Cach
eMisses
0 2 4 6 8 10 12
200
400
600
800
·103
No. of Processes
ContextSwitch
es
0 2 4 6 8 10 12
980
990
1,000
No. of Processes
Runtime(s)
(e) calculix
0 2 4 6 8 10 12
140
160
·106
No. of Processes
Cach
eMisses
0 2 4 6 8 10 12
100
200
300
400
·103
No. of Processes
ContextSwitch
es
0 2 4 6 8 10 12
490
500
510
No. of Processes
Runtime(s)
(f) dealII
Figure 5.5: Variation of Cache Misses, Context Switches and Runtime (cont.)
39
0 2 4 6 8 10 12
25
30
35
·106
No. of Processes
Cach
eMisses
0 2 4 6 8 10 12
0
50
100
150
·103
No. of Processes
ContextSwitch
es
0 2 4 6 8 10 12
132
134
136
138
No. of Processes
Runtime(s)
(g) hmmer
0 2 4 6 8 10 12
30
40
50
60
·106
No. of Processes
Cach
eMisses
0 2 4 6 8 10 12
200
400
·103
No. of Processes
ContextSwitch
es
0 2 4 6 8 10 12
700
750
No. of Processes
Runtime(s)
(h) lbm
0 2 4 6 8 10 12
30
40
50
60·106
No. of Processes
Cach
eMisses
0 2 4 6 8 10 12
200
400
600
·103
No. of Processes
ContextSwitch
es
0 2 4 6 8 10 12
720
740
No. of Processes
Runtime(s)
(i) libquantum
Figure 5.5: Variation of Cache Misses, Context Switches and Runtime (cont.)
40
0 2 4 6 8 10 12
40
60
80
·106
No. of Processes
Cach
eMisses
0 2 4 6 8 10 12
200
400
600
800
·103
No. of Processes
ContextSwitch
es
0 2 4 6 8 10 12
950
1,000
No. of Processes
Runtime(s)
(j) mcf
0 2 4 6 8 10 12
1,020
1,040
1,060
1,080
·106
No. of Processes
Cach
eMisses
0 2 4 6 8 10 12
0
100
200
·103
No. of Processes
ContextSwitch
es
0 2 4 6 8 10 12
225
230
No. of Processes
Runtime(s)
(k) povray
0 2 4 6 8 10 12
120
140
160
180
200
·106
No. of Processes
Cach
eMisses
0 2 4 6 8 10 12
200
400
600
800
·103
No. of Processes
ContextSwitch
es
0 2 4 6 8 10 12
850
900
950
No. of Processes
Runtime(s)
(l) sphinx3
Figure 5.5: Variation of Cache Misses, Context Switches and Runtime (cont.)
41
0 2 4 6 8 10 12
2,100
2,150
·106
No. of Processes
Cach
eMisses
0 2 4 6 8 10 120
100
200
300
·103
No. of Processes
ContextSwitch
es
0 2 4 6 8 10 12
360
380
400
420
No. of Processes
Runtime(s)
(m) xalancbmk
Figure 5.5: Variation of Cache Misses, Context Switches and Runtime (cont.)
42
5.4 Speedup Relative to the Worst Case Performance
Using the write bandwidth for 3D eDRAM and the number of context
switches for each of the benchmarks, the time spent in saving the cache con-
text can be analytically calculated. The write bandwidth for the 3D eDRAM
depends on the data width and the memory capacity. A 256 bit data width
and 64MB memory capacity is used so that cache contexts for at least 10 dif-
ferent processes can be saved and retrieved from the Cache Context Store. To
make a fair comparison, the cache context size is assumed to be 6 MB which
was the size of the cache for the experiment in the previous section.
Optimizations such as not saving invalid and outdated cache lines are
not taken into account, so this is a very pessimistic analysis. If the time taken
to save the Cache Context once is tsave and the total time spent doing this
for the entire runtime of a benchmark is toverhead.
tsave = Cache Size / Write Bandwidth
toverhead = tsave ∗ No. of Context Switches
The total saving overhead for an entire program has three components.
• Size of the Context to be saved on every context switch.
• Bandwidth of the Cache Context Store.
• Number of time slices necessary to run to completion.
43
Of the above, the number of time slices a program needs to run to
completion corresponds to the number of context switches. For Linux, this de-
pends on the load of the system since the time slice can vary from 1ms to 10ms
or more. So, if we use the average time slice from when the benchmark was
running on its own ie. closer to the maximum available time slice value, that
would result in a very optimistic speedup estimation. If we use the minimum
time slice a benchmark can get from the OS, then it would result in a very
pessimistic speedup estimation. A more realistic speedup estimation can be
obtained by using the median time slice value from running 0 to 9 co-executing
processes.
Figure 5.6 shows these speedups due to saving and reusing the cache
context when compared to the worst case runtime without saving the cache
context. The achieved speedup ranges from -1.3% to 19.1% and 7.4% on
average for an optimistic estimation. It ranges from -5.4% to 12.0% and 3.3%
on average for a realistic estimation. And it ranges from -5.9% to 11.1% and
0.9% on average for a pessimistic estimation. A consistent improvement is
observed for many benchmarks across all estimations. All benchmarks with
negative speedup can be stopped from using the Cache Context Store using
dynamic performance measurements similar to Elastic Time Slicing in [4].
44
perlbench
calculix
dealII
povray
gcc
hmmer
libquantum
mcf
bzip2
lbm
astar
sphinx3
xalancbm
k
GMEAN
AMEAN
HMEAN
0.95
1
1.05
1.1
1.15
Speedup
(a) Optimistic
hmmer
perlbench
calculix
dealII
povray
gcc
libquantum
bzip2
mcf
lbm
astar
xalancbm
k
sphinx3
GMEAN
AMEAN
HMEAN
0.95
1
1.05
1.1
1.15
Speedup
(b) Realistic
povray
hmmer
calculix
dealII
perlbench gc
c
libquantum
mcf
astar
bzip2
lbm
sphinx3
xalancbm
k
GMEAN
AMEAN
HMEAN
0.95
1
1.05
1.1
1.15
Speedup
(c) Pessimistic
Figure 5.6: Speedup for a 64 MB 3D eDRAM Cache Context Store
45
Chapter 6
Limitations & Future Improvements
This chapter addresses limitations of the proposed design, the feasibility
of deploying this solution in a real system, and possible future improvements
based on learning from this experiment.
6.1 Limitations of the Proposed Design
All programs have unique memory access patterns. Even a single pro-
gram can go through different phases of execution, in a particular phase the
memory access pattern might look different from a previous phase. There-
fore, solutions such as Elastic Time Slicing [4] evaluate the impact of context
switches on a running program dynamically and hence are more conducive for
implementation in today’s systems. On the other hand, such solutions require
the Operating System to be modified to make use of this information exposed
to the software from the hardware.
The proposed design has the ability to be a pure microarchitectural
solution without the need for software to interfere in normal use cases. Having
said that, the main drawback of the proposed design is that it spends time
in saving the cache context to the Cache Context Store every time a context
46
switch occurs. This means that the design is unable to differentiate between
various programs and program phases and adapt its behaviour accordingly.
6.2 Feasibility for Deployment of this Solution in a RealSystem
Section 5.2 briefly referred to some potential challenges in extending
this solution to a practical design. There have been a number of simplifying
assumptions made initially to reduce the problem to a tractable level. But the
idea that cache context is valuable runtime information known to the hardware
can be useful in the future, if it can be saved and retrieved swiftly. This is very
much extendable to any multitasking system with a cache. Therefore, even
though the proposed solution is not feasible today, a more practical solution
is very much possible in the near future.
6.3 Possible Future Improvements
As mentioned in Section 6.1, this design spends time in saving the cache
context by moving the tag and data store contents to the byte-addressable
Cache Context Store. Although this overhead is theoretically lower than the
penalty due to cache misses caused by context switching, this is still an over-
head every time a context switch occurs. To eliminate this time spent in
saving the cache context to the Cache Context Store completely, the obvious
next step is to snoop the memory transactions and build an independent copy
of the cache context in the Cache Context Store. This would mean that there
47
is no time spent in copying the tag and data store contents every time a con-
text switch occurs, instead the cache context is incrementally built within the
Cache Context Store at runtime. The tradeoff then would be between power
and performance requirements of the system.
48
Bibliography
[1] R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau, Operating Systems:
Three Easy Pieces, 0th ed. Arpaci-Dusseau Books, May 2015.
[2] D. Daly and H. W. Cain, “Cache Restoration for Highly Partitioned Vir-
tualized Systems,” in High Performance Computer Architecture (HPCA),
2012 IEEE 18th International Symposium on. IEEE, 2012, pp. 1–10.
[3] M. C. Davis, “A Computer for Low Context-Switch Time,” Tech. Rep.
90-012, 1990. [Online]. Available: http://www.cs.unc.edu/techreports/
90-012.pdf
[4] N. Jammula, M. Qureshi, A. Gavrilovska, and J. Kim, “Balancing Context
Switch Penalty and Response Time with Elastic Time Slicing,” in High
Performance Computing (HiPC), 2014 21st International Conference on.
IEEE, 2014, pp. 1–10.
[5] A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer, “High Perfor-
mance Cache Replacement Using Re-Reference Interval Prediction (RRIP),”
in ACM SIGARCH Computer Architecture News, vol. 38, no. 3. ACM,
2010, pp. 60–71.
[6] C. Chou, A. Jaleel, and M. K. Qureshi, “CAMEO: A Two-Level Memory
Organization with Capacity of Main Memory and Flexibility of Hardware-
49
Managed Cache,” in Proceedings of the 47th Annual IEEE/ACM Interna-
tional Symposium on Microarchitecture. IEEE Computer Society, 2014,
pp. 1–12.
[7] J. Ren, J. Zhao, S. Khan, J. Choi, Y. Wu, and O. Mutlu, “ThyNVM:
Enabling Software-Transparent Crash Consistency in Persistent Memory
Systems,” in Proceedings of the 48th International Symposium on Mi-
croarchitecture. ACM, 2015, pp. 672–685.
[8] Persistent Memory Programming, NVM Library. [Online]. Available:
http://pmem.io/
[9] M. M. S. Aly, M. Gao, G. Hills, C.-S. Lee, G. Pitner, M. M. Shu-
laker, T. F. Wu, M. Asheghi, J. Bokor, F. Franchetti et al., “Energy-
Efficient Abundant-Data Computing: The N3XT 1,000x,” Computer,
vol. 48, no. 12, pp. 24–33, 2015.
[10] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai,
and O. Mutlu, “Flipping Bits in Memory Without Accessing Them: An
Experimental Study of DRAM Disturbance Errors,” in ACM SIGARCH
Computer Architecture News, vol. 42, no. 3. IEEE Press, 2014, pp. 361–
372.
[11] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting Phase Change
Memory as a Scalable DRAM Alternative,” in ACM SIGARCH Computer
Architecture News, vol. 37, no. 3. ACM, 2009, pp. 2–13.
50
[12] E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu, “Eval-
uating STT-RAM as an Energy-Efficient Main Memory Alternative,” in
Performance Analysis of Systems and Software (ISPASS), 2013 IEEE
International Symposium on. IEEE, 2013, pp. 256–267.
[13] J. Meza, Y. Luo, S. Khan, J. Zhao, Y. Xie, and O. Mutlu, “A Case
for Efficient Hardware/Software Cooperative Management of Storage and
Memory,” 2013.
[14] Schematic Diagram of the High and Low Resistance States in a Spin
Valve. [Online]. Available: https://commons.wikimedia.org/wiki/File:
Spin valve schematic.svg
[15] Cross Section of two PRAM Memory Cells. [Online]. Available:
https://commons.wikimedia.org/wiki/File:PRAM cell structure.svg
[16] Structure of an RRAM Memory Cell. [Online]. Available:
http://spectrum.ieee.org/image/MTY3ODUwMw
[17] M. K. Qureshi, S. Gurumurthi, and B. Rajendran, “Phase Change Mem-
ory: From Devices to Systems,” Synthesis Lectures on Computer Archi-
tecture, vol. 6, no. 4, pp. 1–134, 2011.
[18] S. Mittal, J. S. Vetter, and D. Li, “A Survey of Architectural Approaches
for Managing Embedded DRAM and Non-volatile On-chip Caches,” IEEE
Transactions on Parallel and Distributed Systems, vol. 26, no. 6, pp.
1524–1537, 2015.
51
[19] M. Poremba, S. Mittal, D. Li, J. S. Vetter, and Y. Xie, “DESTINY: A Tool
for Modeling Emerging 3D NVM and eDRAM caches,” in Proceedings of
the 2015 Design, Automation & Test in Europe Conference & Exhibition.
EDA Consortium, 2015, pp. 1543–1546.
[20] S. Mittal, M. Poremba, J. Vetter, and Y. Xie, “Exploring Design Space
of 3D NVM and eDRAM Caches Using DESTINY Tool,” Oak Ridge Na-
tional Laboratory, USA, Tech. Rep. ORNL/TM-2014/636, 2014.
[21] M. K. Qureshi and Y. N. Patt, “Utility-Based Cache Partitioning: A Low-
Overhead, High-Performance, Runtime Mechanism to Partition Shared
Caches,” in Microarchitecture, 2006. MICRO-39. 39th Annual IEEE/ACM
International Symposium on. IEEE, 2006, pp. 423–432.
[22] SPEC CPU2006. [Online]. Available: https://www.spec.org/cpu2006/
[23] X. Dong, C. Xu, N. Jouppi, and Y. Xie, “NVSim: A Circuit-Level Per-
formance, Energy, and Area Model for Emerging Nonvolatile Memory,”
in Emerging Memory Technologies. Springer, 2014, pp. 15–50.
[24] M. M. Shulaker, G. Hills, H.-S. P. Wong, and S. Mitra, “Transforming
Nanodevices to Next Generation Nanosystems,” in Embedded Computer
Systems: Architectures, Modeling and Simulation (SAMOS), 2016 Inter-
national Conference on. IEEE, 2016, pp. 288–292.
[25] M. M. Shulaker, T. F. Wu, A. Pal, L. Zhao, Y. Nishi, K. Saraswat, H.-S. P.
Wong, and S. Mitra, “Monolithic 3D Integration of Logic and Memory:
52
Carbon Nanotube FETs, Resistive RAM, and Silicon FETs,” in Electron
Devices Meeting (IEDM), 2014 IEEE International. IEEE, 2014, pp.
27–4.
[26] S. Mittal, J. S. Vetter, and D. Li, “Improving Energy Efficiency of Em-
bedded DRAM Caches for High-end Computing Systems,” in Proceedings
of the 23rd international symposium on High-performance parallel and
distributed computing. ACM, 2014, pp. 99–110.
[27] Y. Patt and S. Patel, Introduction to Computing Systems. McGraw-Hill,
2003.
[28] A. Agarwal, J. Hennessy, and M. Horowitz, “Cache Performance of Op-
erating System and Multiprogramming Workloads,” ACM Transactions
on Computer Systems (TOCS), vol. 6, no. 4, pp. 393–431, 1988.
53