View
4
Download
0
Category
Preview:
Citation preview
The Pennsylvania State University
The Graduate School
College of Engineering
KERNEL-BASED ENERGY OPTIMIZATION IN GPUS
A Thesis in
Computer Science and Engineering
by
Amin Jadidi
© 2015 Amin Jadidi
Submitted in Partial Fulfillment
of the Requirements
for the Degree of
Master of Science
December 2015
The thesis of Amin Jadidi was reviewed and approved∗ by the following:
Chita Das
Professor of Computer Science and Engineering
Thesis Advisor
Mahmut T. Kandemir
Professor of Computer Science and Engineering
Raj Acharya
Head of the Department of Computer Science
∗Signatures are on file in the Graduate School.
ii
Abstract
Emerging GPU architectures offer a cost-effective computing platform by providing thousandsof energy-efficient compute cores and high bandwidth memory that facilitate the execution ofhighly parallel applications. In this paper, we show that different applications, and in fact differ-ent kernels from the same application might exhibit significantly varying utilizations of computeand memory resources. However, we observe that the same kernel displays similar behaviorin its different invocations; moreover, most of the kernels are invoked many times during thecourse of execution. By exploiting these properties of kernels, in order to improve the energy ef-ficiency of the GPU system, we propose a dynamic resource configuration strategy that classifieskernels as compute-intensive or memory-intensive based on their resource utilizations and dy-namically employs memory voltage/frequency scaling or core shut-down techniques for compute-and memory-intensive kernels, respectively. This strategy uses performance and memory band-width utilization information from the first few invocations of a kernel to determine the optimalhardware configuration for future invocations. Experimental evaluations show that our strategysaves about 20% of total chip energy and 70% of total memory leakage power for memory andcompute-intensive kernels respectively, which are within 8% of the optimal savings that can beobtained from an oracle scheme.
iii
Table of Contents
List of Figures vi
List of Tables viii
Chapter 1Introduction 1
Chapter 2Background 4
Chapter 3Motivation 63.1 Investigating Resource Underutilization in GPUs . . . . . . . . . . . . . . . . . . 63.2 Quantifying Resource Underutilization in GPUs at the Kernel Granularity . . . . 93.3 Kernel-level Properties of GPGPU Applications . . . . . . . . . . . . . . . . . . . 123.4 Classifying GPGPU Kernels Dynamically at Runtime . . . . . . . . . . . . . . . 13
Chapter 4Kernel Based GPU Resource Management 154.1 CSS: Core-Side Energy Optimization Scheme . . . . . . . . . . . . . . . . . . . . 154.2 MSS: Memory-Side Energy Optimization Scheme . . . . . . . . . . . . . . . . . . 174.3 Handling Outlier Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.1 Kernels that are Executed Only Once (Outlier1-Scheme): . . . . . . . . . 184.3.2 Kernels with Different Behavior over Different Launches (Outlier2-Scheme): 20
4.4 Putting it Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.5 Microarchitecture and Hardware Overhead . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 5Experimental Results 245.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.2 Dynamism of the CSS and MSS Techniques . . . . . . . . . . . . . . . . . . . . . 265.3 Evaluation of CSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.4 Evaluation of MSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.5 Running Multiple Applications Concurrently . . . . . . . . . . . . . . . . . . . . 31
Chapter 6Related Work 33
iv
Chapter 7Conclusions 36
Bibliography 37
v
List of Figures
2.1 Target GPGPU architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1 Effect of increasing the number of cores on the performance of three applications.For each application, the results are normalized with respect to the highest IPCobserved with that application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Effect of memory frequency scaling on the performance of three applications.For each application, the results are normalized with respect to the highest IPCobserved in that application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Energy consumption of select benchmarks with respect to the baseline. Eachbar is normalized with respect to the energy consumption of the correspondingapplication in our base configuration (i.e., without any energy optimization). . . 9
3.4 Effect of increasing the number of cores on the performance of different kernelsfrom MST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5 The two most important kernels in PVC that exhibit different bandwidth utiliza-tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6 IPC of kernels over different invocations. Different invocations of the same kernelprovide very similar performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1 Performing binary search on the (a) number of SMs and (b) memory frequency/-voltage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Transitioning between the compute-intensive(CI) and memory-intensive(MI) statesin PVC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Microarchitecture design of the proposed techniques. . . . . . . . . . . . . . . . . 22
5.1 Stabilization of binary search (ideal number of SMs) for three different kernelsfrom HIST application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Stabilization of binary search (ideal memory V/F) for two main kernels from PVCapplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Dynamic power consumption breakdown across memory-intensive kernels (base-line, without any energy optimization). . . . . . . . . . . . . . . . . . . . . . . . . 27
5.4 Energy saving gained by using optimal number of SMs and power gating the restof SMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.5 Analyzing the saturated region for MST and Stencil. . . . . . . . . . . . . . . . . 285.6 Average number of cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.7 Normalized IPC values for different applications with respect to the baseline. . . 29
vi
5.8 Normalized memory leakage power for different application with respect to thebaseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
vii
List of Tables
5.1 Baseline configuration. We normalize the results with respect to the number ofcores and memory voltage/frequency configuration denoted in bold. . . . . . . . 25
5.2 List of GPU benchmarks: Type-M: applications with MBU > 40%. Type-O: outlierapplications. Type-C: applications with MBU < 10%. In the last column, M andC refers to memory- and compute-intensive kernels . . . . . . . . . . . . . . . . 25
viii
Chapter 1 |
Introduction
GPUs are being increasingly employed to accelerate different types of computing platforms rang-
ing from embedded devices to supercomputers. As a result, today's GPUs are not running only
graphics applications, but also, applications from database domain and high-performance com-
puting domain, among others. To cope with contrasting demands of these different types of appli-
cations, GPU architects keep increasing on-chip resources such as cores, caches, software-managed
memories and memory controllers (MCs), and projections [1] include even more powerful/resource-
rich GPU systems in the near future. An important question at this juncture is whether current
applications effectively utilize available on-chip resources in GPUs and, if not, why and what can
be done about it. Several recent papers [2–5] have focused on this resource utilization problem
and proposed techniques that handle underutilized hardware components in GPUs.
Our own research shows that both cores and memory bandwidth are highly underutilized in
many GPU applications. Moreover, different kernels of the same application can have significant
variations regarding utilizations of cores and memory bandwidth, making a universal solution
that works across different applications highly unlikely. Motivated by this observation, this paper
1
proposes an energy-saving strategy that exploits resource underutilization in GPUs. The unique
aspect of this strategy is that it operates at a kernel granularity, and uses both core shut-down
and memory frequency/voltage scaling to achieve significant energy savings.
Our main contributions in this work can be summarized as follows:
• We present empirical evidence clearly showing the underutilization of cores as well as mem-
ory bandwidth in a set of GPU applications. Our experimental results indicate not only that
different applications exhibit different utilizations of these two critical resources, but also
even different kernels of the same application exhibit significant resource utilization varia-
tions. Based on these results, we classify kernels in our applications as memory-intensive
and compute-intensive.
• We show that, despite the variations across kernels, different invocations of the same kernel
exhibit very similar behavior under a fixed resource allocation. This observation, combined
with the fact that an average kernel in GPU applications is invoked many times during the
course of execution, motivates for a dynamic, history-based optimization strategy.
• We propose a kernel-level energy optimization strategy that capitalizes on these variations
across different kernels. Specifically, our strategy employs core shut-down for memory-
intensive kernels and memory frequency/voltage scaling for compute-intensive kernels. The
proposed binary search-based strategy uses the first few executions (invocations) of a kernel
to determine the ideal hardware configuration (in terms of the number of cores and memory
V/F) to be used in the remaining invocations.
• We present an experimental evaluation of our strategy using a set of GPU applications. The
results collected via a detailed simulation infrastructure indicate that our strategy saves
about 20% of total chip energy for memory-intensive and 70% of total memory leakage
power for compute-intensive kernels in our base architecture. We further show that these
savings are within 8% of optimal savings that could be obtained from a hypothetical scheme,
2
and are consistent across different values of our major simulation parameters.
• We also show in this paper how our approach handles outlier kernels that (i) are executed
only once or (ii) exhibit irregular behavior across successive invocations.
We believe this is the first paper that dynamically tunes the number of active SMs, at a
kernel level granularity, based on the memory utilization. In this line, Hong et al. [3] propose
an analytical model which predicts the ideal number of cores by using off-line information. In
another work, Lee et al. [4] analyze the impact of DVFS on cores/interconnect/caches and core
scaling techniques under a fixed power budget; however, they do not propose a dynamic scheme
for tunning these parameters at run-time.
The rest of this paper is structured as follows. Section 2 provides background on our target
GPU architecture. In Section 3, we describe the resource underutilization problem in GPUs. In
Section 4, we present our techniques for adaptive GPU reconfiguration. Section 5 presents an
experimental evaluation of the proposed strategy. Section 6 reviews related work and Section 7
concludes the paper.
3
Chapter 2 |
Background
In this section, we provide a brief background on the GPGPU architecture targeted by our work.
GPU Architecture: Our target GPU consists of multiple streaming multiprocessors (SMs),
each containing 32 CUDA cores [6]. Each CUDA core can execute a thread, in a “single-
instruction, multiple-threads” (SIMT) fashion. This architecture is supported by a large register
file that hides the latency of memory accesses [7,8]. The memory requests generated by multiple
concurrently executing threads in an SM are coalesced into fewer cache lines and sent to L1 data
cache, shared by all CUDA cores in the SM. Each SM also contains a read-only texture and
Last Level $ (L2)
Memory Controller
DRAM
Interconnection Network (Crossbar)
...
Streaming Multiprocessor
(SM)...
CPU LaunchCUDA Cores
Data $Constant $Texture $
Shared Memory
Last Level $ (L2)
Memory Controller
DRAM
Last Level $ (L2)
Memory Controller
DRAM
Streaming Multiprocessor
(SM)
Streaming Multiprocessor
(SM)
Figure 2.1. Target GPGPU architecture.
4
a constant cache, and a low-latency software-managed shared memory. Misses in these caches
are injected into the network, which connects the SMs to 6 memory partitions through a cross-
bar [9]. Each memory partition includes a slice of shared L2 cache, and memory controllers
(MCs) [10, 11]. Figure 2.1 shows this baseline architecture. We assume that per-SM power-
gating [3,4], and multiple voltage-frequency states for memory controllers [5,12,13] are available
to our baseline.
GPGPU Applications: The execution on GPU starts with the memory allocation in GPU
memory by the CPU. Then, CPU copies the required data into this allocated memory, and
a kernel is launched on GPU. During the course of execution of an application, CPU generally
launches multiple kernels on GPU. Each kernel consists of a set of parallel lightweight threads [14].
Once all the threads finish their computation, resulting data are copied to the CPU memory and
the GPU memory is freed. The number of threads in a kernel is a good indicator if the kernel is
scalable up to the number of SMs present in the system. Kernels containing only a few threads
would keep most of the core resources idle, thus the application scalability would be the limiting
factor of performance. Moreover, high memory demands of these concurrently-executing threads
might saturate the memory bandwidth, causing the insufficient memory bandwidth to be the
limiting factor of performance [3].
5
Chapter 3 |
Motivation
In this section, we characterize our applications at the kernel level to understand their resource
(core and memory) usage profiles. This characterization demonstrates the need for a dynamic
resource management scheme.
3.1 Investigating Resource Underutilization in GPUs
The main philosophy of GPGPU architectures is to provide a large number of computing cores
supported by a high bandwidth memory, in order to have a high throughput system. Such a
resource-rich design will be cost-effective only if the main resources such as computing cores
and memory are effectively utilized by hosted applications. Thus, it is vital to understand the
impact of different applications on the utilization of GPU resources. Typically, memory-intensive
applications utilize the memory bandwidth, but might not need all the cores to be active in
order to optimize its performance. On the other hand, compute-intensive applications utilize the
compute cores well, but leave the memory bandwidth underutilized.
To illustrate the effect of the number of cores on the system, let us examine Figure 3.1 (our
experimental setup will be given in Section 5.1). This figure shows the application performance
6
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20 25 30N
orm
alized
IP
C
Number of Cores
MUM PATH BFS
Figure 3.1. Effect of increasing the number of cores on the performance of three applications. For eachapplication, the results are normalized with respect to the highest IPC observed with that application.
(normalized IPC), as we vary the number of available cores in three representative applications
(PATH, MUM and BFS) exhibiting different behaviors. Among these applications, PATH is
the only compute-intensive application, and its performance increases linearly as we increase
the number of cores, meaning that the application utilizes all the available GPU computational
resources.
The other applications are memory-intensive, and we observe that their performance does not
improve beyond a certain point. In fact, we even see some performance loss in BFS. In order to
understand the underlying reason for the saturation in performance with the increasing number
of cores, we need to look at the memory bandwidth demands of GPGPU applications.
Each thread of a GPGPU application has a certain memory bandwidth demand, and assuming
that an application has enough threads, as we increase the number of cores, we also increase the
number of concurrently-running threads, which in turn leads to an increase in the number of
memory transactions (read/write requests) from cores to memory per unit time. Beyond a
certain number of cores, this increase in memory traffic might cause the memory bandwidth to
saturate. Beyond this number, using additional cores will not improve performance, instead it
might lead to longer latency for memory accesses and longer stall times for the cores [15,16].
Besides, an application using an excessive number of concurrently-running threads might
suffer from severe resource contention [2,17], which further exacerbates the problem. If we could
somehow know the ideal core count for a memory-intensive application (i.e., the number of cores
7
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1N
orm
ali
ze
d I
PC
Frequency Scaling
DMR MM CONS
Figure 3.2. Effect of memory frequency scaling on the performance of three applications. For eachapplication, the results are normalized with respect to the highest IPC observed in that application.
beyond which we do not gain significant performance benefits), we could turn off the remaining
cores to save power without hurting performance significantly, and potentially improve it in some
applications such as BFS (see Figure 3.1).
To illustrate the effect of memory bandwidth on application performance, we vary the mem-
ory frequency, and in turn the corresponding memory supply voltage. Figure 3.2 shows the
application performance in three representative applications (DMR, MM and CONS) exhibit-
ing different behaviors. The performance of CONS, which is a memory-intensive application,
increases linearly with increasing memory V/F. On the other hand, DMR does not have many
memory transactions, and the thread-level parallelism available in the application provides suffi-
cient latency tolerance, and thus, it is not sensitive to the memory speed. MM exhibits a behavior
similar to CONS when memory V/F is increased up to certain point.
Further, an increase in memory bandwidth does not affect MM’s performance, as the ap-
plication has enough latency tolerance, and its behavior resembles that of DMR. Therefore, by
figuring out the suitable V/F for the memory, one could potentially reduce its energy consump-
tion without hurting the performance. Because the abundantly available thread-level parallelism
(TLP) in GPUs provides latency tolerance, reducing the memory frequency should not affect the
application performance significantly as long as there are enough threads to keep the cores busy.
Thus, from a performance angle, voltage/frequency scaling at the memory level can be a suitable
knob for compute-intensive applications.
8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SSSP BFS SP Stencil HIST MST PVC Kmean
No
rma
lize
d T
ota
l E
ne
rgy
Co
ns
um
pti
on
Optimal-Config-For-Application Optimal-Kernel-Based
Figure 3.3. Energy consumption of select benchmarks with respect to the baseline. Each bar is normal-ized with respect to the energy consumption of the corresponding application in our base configuration(i.e., without any energy optimization).
Take-away points: Taking all these factors into account, we conclude that (1) memory-
intensive applications can underutilize the computing cores, and (2) compute-intensive applica-
tions can underutilize the available memory bandwidth. Therefore, finding the ideal number
of active cores for memory-intensive applications and the ideal memory frequency/voltage for
compute-intensive applications is imperative in order to achieve a system with lower energy
consumption.
3.2 Quantifying Resource Underutilization in GPUs at the Ker-
nel Granularity
For a more detailed analysis of the effects of reconfiguring hardware resources on application
performance and energy consumption, we investigate our applications at a finer granularity, i.e.,
at the kernel level. Each GPGPU application consists of one/multiple kernel(s), each of which
is launched once/multiple-times during the execution of that application. Different kernels that
belong to the same application might exhibit a large variance in their resource demands, leading
to different resource utilizations across different phases of the application.
Figure 3.3 shows the normalized energy consumption of 8 memory-intensive applications with
respect to the baseline. First, for each application, we varied the number of cores between 1 and
9
0
100
200
300
400
500
1 6 11 16 21 26 31
IPC
Number of Cores
Kernel_1 Kernel_2 Kernel_3 Kernel_4
Figure 3.4. Effect of increasing the number of cores on the performance of different kernels from MST.
32, and executed them with a different core count, fixed throughout the whole execution. The
first bar in the figure shows the lowest energy consumption (while maintaining approximately
the same performance as the baseline). Next, we varied the number of cores for each individual
kernel, but observed that the optimal core count that provides the lowest energy consumption
for each kernel is different. The second bar shows the energy consumption when each kernel is
executed with its corresponding optimal core count (again, under the same performance as in
the base case).
One can see from these results that, for four applications (HIST, MST, PVC, Kmean), the
optimal configuration at the kernel granularity provides lower energy consumption than that
of the optimal configuration at the application granularity. This is because these applications
consist of multiple kernels, and each kernel has a different resource requirement. On the other
hand, there is no difference between application-level optimization and kernel-level optimization
for the applications that have only 1 kernel executed many times. This figure shows that (1) we
can reduce energy consumption without hurting performance by changing the number of cores,
and (2) modulating the number of cores at the kernel-granularity is better than doing so at the
application-granularity. We are not aware of any application-based approach that dynamically
determines the best GPU configuration, and the results reported in Figure 3.3 for application-
based approach is collected by analyzing statistical data which represents the best hypothetical
application-based approach.
10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Ba
nd
wid
th U
tili
za
tio
n
Sequence of Launched Kernels
Kernel_1 Kernel_2
Figure 3.5. The two most important kernels in PVC that exhibit different bandwidth utilizations.
Next, we examine two applications with multiple kernels, MST and PVC, in more detail
and explain why the kernel-level optimum provides better energy savings than the application-
level optimum. Figure 3.4 shows the effect of the number of cores on the performance of four
different kernels from MST. Each of these kernels is executed many times during the course
of execution of this application. Considering the performance scalability of a kernel as the
number of cores is varied, Kernel-4 can be classified as compute-intensive while the other three
kernels are relatively memory-intensive. Moreover, each of these memory-intensive kernels has a
different saturation point where the performance does not increase with more compute resources.
Similarly, Figure 3.5 shows two different kernels of PVC. We observe that, from the perspective
of memory bandwidth utilization, Kernel-2 can be considered as compute-intensive and Kernel-1
as memory-intensive. Figure 3.4 and Figure 3.5 both point to the fact that it might not always be
the case that an application exhibits a similar behavior in terms of its resource demands during
its entire execution. Instead of classifying applications as memory- or compute-intensive based
on the whole execution, doing it so based on the kernel-granularity information might give us
a more accurate picture of their behavior. In fact, we observe that some GPGPU applications
have both compute-intensive and memory-intensive kernels. Therefore, in order to reduce energy
consumption of an application, while it is better to employ core shut-down for a particular kernel,
it might be better to employ memory DVFS for another kernel.
Take-away point: Due to significantly different resource demands of different kernels of an
11
0.94
0.96
0.98
1
1.02
1.04
1.06
No
rmal
ize
d IP
C w
ith
me
dai
n
BFS HIST-1 HIST-2 HIST-3 MST SP SSSP Stencil DMR SPMV-S
Figure 3.6. IPC of kernels over different invocations. Different invocations of the same kernel providevery similar performance.
application, hardware reconfiguration strategies should operate at a kernel-granularity.
3.3 Kernel-level Properties of GPGPU Applications
By analyzing the launched kernels during the execution of different applications, we observe two
important properties: First, most of the kernels are launched multiple times. Second, those ker-
nels that are launched multiple times exhibit a very consistent behavior (IPC, DRAM bandwidth
utilization, etc.) in different invocations. Figure 3.6 shows the stability of select kernels across
different invocations. We collect IPC of kernels that are executed multiple times and normalize it
to the median. It can be seen that, for each kernel, it’s IPC in different invocations is distributed
within a 2% distance from the median IPC. Based on this figure, we can safely assume that a
kernel would exhibit similar behavior in its different invocations. We will talk about this margin
more in Section 4.1. Note also that, we observed similar stability for the memory bandwidth
utilization of a kernel in different executions. It should also be mentioned that, across all the
applications we analyzed in this work, we encountered two applications which potentially benefit
from using fewer active SMs, but the target kernel does not show a stable IPC over different
launches. In Section 4.3.2, we further analyze such outlier scenarios, and explain how we handle
them.
Take-away point: Motivated by these two important observations, the kernel-granularity
12
information that we can use in our strategy can be collected from that kernel’s first few executions.
We can use this information to figure out what the optimal configuration (number of active SMs
and memory V/F) for that kernel is. After finding this optimal configuration, we can execute
the remaining invocations using the optimal configuration.
3.4 Classifying GPGPU Kernels Dynamically at Runtime
Each GPU thread has a certain memory bandwidth demand. As we increase the number of
SMs, the number of concurrently-running threads increases, and as a result, we have a higher
bandwidth demand. To investigate the effect of increasing compute resources on the memory
system, we monitor thememory bandwidth utilization (MBU), which is equal to the DRAM cycles
in which a read or write request is served, divided by the total DRAM cycles. Intuitively, we
expect to see performance improvement as we increase the number of SMs, as long as we do not
saturate the available memory bandwidth (theoretically, MBU = 100%). However, we observed
in our experiments that the performance of the memory-intensive kernels gets saturated when
MBU is much less than 100%. For instance, in application SP, we observe that using all the SMs
leads to about 55% memory bandwidth utilization. If we keep reducing the number of active SMs
down to 11, we still observe the same IPC and MBU. Note that, the statistical model reported
in [3] may not capture this behavior because they assume that as long as MBU is less 100%,
increasing the number of SMs will improve the performance. Similarly, we studied the impact
of memory V/F scaling on a wide variety of GPGPU applications, and noticed that only kernels
with very low MBU are not affected negatively by scaling.
To find the optimal hardware configuration for a kernel, we need an approach to identify if a
kernel is memory- or compute-intensive dynamically during runtime. Based on our preliminary
experiments, we find that 40% memory bandwidth utilization is a reasonable threshold to classify
a kernel as memory-intensive. Therefore, if the MBU of a kernel is above this number, we consider
13
that kernel as memory-intensive, meaning that this kernel is potentially using too many SMs.
Similarly, based on our preliminary experiments on memory V/F scaling, we consider a kernel
with memory bandwidth utilization less than 10% to be compute-intensive, meaning that we can
potentially employ DVFS on memory without affecting performance significantly.
Sensitivity: Note that, the main goal of our classification is just to identify kernels that are
amenable to core turn-off and memory scaling. We found our two thresholds (40% and 10%) to
be very accurate on capturing memory/compute intensive kernels. In fact, in our experiments, we
never encountered a kernel with an MBU of less than 40%, which can gain from using less number
of active SMs. Besides, we also tested the 50% and 60% thresholds for detecting memory-sensitive
kernels. We observed that, if we use 50% MBU threshold, we lose some of the opportunities
that could have been exploited to save energy in HIST and PVC applications. On the other
hand, if we use 60% MBU threshold, we would lose most of the opportunities over all the
applications. Therefore, by using the 40% memory bandwidth utilization as our threshold, we
could detect all the kernels which potentially should use less number of active SMs. Similarly,
our sensitivity experiments showed that 10% bandwidth utilization is a good threshold to tag a
kernel as compute-intensive.
Take-away point: We can classify a kernel as compute- or memory-intensive at the end of
its first execution. This information can be used to modulate the number of cores or memory
V/F to reduce the energy consumption of the system.
14
Chapter 4 |
Kernel Based GPU Resource Man-
agement
In this section, we describe our strategy to find the ideal SM count and the memory V/F con-
figuration for each kernel. The first time a kernel is launched by the CPU, we allocate all the
available SMs to that kernel and set the memory to its highest V/F. When the execution of this
kernel is over, we collect the IPC and MBU of that kernel. If the kernel is recognized as memory-
intensive we employ our Core-Side Energy Optimization Scheme (CSS) to find the ideal number
of active SMs for that kernel, and if it is classified as compute-intensive we use our Memory-Side
Energy Optimization Scheme (MSS) to figure out the ideal memory F/V. Otherwise (i.e., 10% <
MBU < 40%), we use all the cores and highest memory V/F for the future invocations, because
the kernel utilizes both compute and memory resources in a balanced way.
4.1 CSS: Core-Side Energy Optimization Scheme
The goal of CSS is to employ a core shut-down mechanism to reduce the energy consumption
of memory-intensive kernels. The key idea behind CSS is to monitor IPC and MBU statistics
15
at the end of the first few invocations of a memory-intensive kernel to find the ideal number of
active SMs. To achieve this, we propose a binary search on different number of active SMs over
multiple launches. In other words, we already have the IPC and MBU of the kernel when it has
been allocated all the SMs (first invocation).
The second time the same kernel is launched, it will be allocated only half of the available
SMs and then, based on the IPC and MBU of this execution, we will keep performing the
binary search over the next few launches of that kernel to finally reach to the ideal number of
active SMs. Figure 4.1 shows a sample binary search over five executions. The binary search
takes log(Number-of-SMs) steps to find the ideal answer. Our base architecture has 32 SMs;
consequently, after 5 steps we will reach to the ideal point.
In this search process, we compare the IPC of the new configuration with the IPC of the
very first execution that had all the SMs activated. When we compare two IPCs from two
different launches, we accept a small window of error, which means in practice that, as long as
|IPC2− IPC1| ≤ α, we consider them the same. Such a comparison works fine, because as soon
as we cross back the saturation point and step in the linear part of the execution (Figure 3.1),
the new observed IPC will be considerably less than IPC1.
To the best of our knowledge, the current GPUs do not support a per-core power gating
mechanism. However, judging from the multicore domain [18], we expect that future GPGPUs
with more CUDA cores will give support to fine-grain power gating [3, 4]. Considering this, our
goal is to find the ideal number of active SMs and use the power gating mechanism to power off
the rest of the SMs. This way, we would be able to reduce the overall power consumption in
GPUs, and improve energy efficiency. Note that, in GPUs, the static/leakage power contributes
to a considerable portion of the total power consumption. In fact, Leng et al. [5] reports that,
for an NVIDIA GTX 480 card, static power consumption is approximately 59W, which mostly
belongs to the computing cores. As technology nodes continue to shrink, and the number of
CUDA cores on a GPU card keeps increasing, the leakage power will become a more important
16
32 Cores
16 Cores
8 Cores
24 Cores
12 Cores
4 Cores
10 Cores
14 Cores
11 Cores
IPC = 125
IPC = 125+α
IPC = 100
IPC = 120
IPC = 125+α
IPC = 125+α
Vol1
Vol0.68
Vol0.75
Vol0.6
IPC = 125
IPC = 110
Vol0.81
IPC = 125+α
(a) (b)
Figure 4.1. Performing binary search on the (a) number of SMs and (b) memory frequency/voltage.
factor in total power consumption [1,19]. Hence, by finding the ideal number of active SMs, not
only we can achieve lower dynamic power consumption, but we can also reduce the leakage power
via per-SM power-gating.
4.2 MSS: Memory-Side Energy Optimization Scheme
The algorithm to find the ideal V/F for the memory is very similar to the one explained for finding
the ideal number of active SMs. After the very first execution of a kernel if we recognize it as
compute-intensive, we employ a binary search to find proper memory V/F over next launches of
that kernel. As we have 7 different pairs of voltage and frequency, it takes 3 steps to reach to the
ideal point (Figure 4.1). Note that, for the memory-intensive kernels, a considerable portion of the
total dynamic power is due to the memory transactions, whereas in compute-intensive kernels
memory system contributes to a small portion of that. Therefore, for the compute-intensive
kernels, even though we scale down the memory V/F, we do not affect the overall dynamic power
consumption much, and the goal behind such scaling is to reduce the memory leakage power
17
during those underutilized phases.
We assume 7 P-states for our target memory, ranging from about 500MHZ to 2GHZ [5, 6].
We use 45nm predictive technology models [20] to scale the voltage with frequency. Based
these parameters, we scale voltage from 1V to 0.55V. Based on our definition of being compute-
intensive, we scale the memory V/F only if the number of transactions with memory is low,
since our focus is to reduce energy consumption while incurring negligible performance loss.
Our simulations show that for kernels with bandwidth utilization over 10%, scaling the V/F of
memory can considerably affect the performance which is not acceptable in our work. Note that,
we can also apply DVFS technique on memory in a finer granularities which is out of the scope
of this paper [12,13].
4.3 Handling Outlier Kernels
We face two types of outlier kernels among different applications. The first category belongs
to the kernels that are executed only once which mostly can be detected at compile time. The
second category includes the kernels which are executed multiple times but do not experience
consistent behavior over those invocations. Such inconsistency is detected by monitoring MBU
for each execution. In this section, we explain how our proposed CSS and MSS schemes handle
these two types of outliers.
4.3.1 Kernels that are Executed Only Once (Outlier1-Scheme):
For such kernels, we employ sampling within the kernel execution period and treat each sam-
pling window as if it is a kernel execution. This way, we can collect the required statistics and
perform our binary search over the initial sampling periods. We can potentially have two types
of sampling, static-interval and dynamic-interval. The static approach assumes a fixed number
of cycles as the sampling period for all kernels. On the other hand, in dynamic approach, the
18
sampling period is set based on the behavior of each kernel, dynamically. We noticed that, in
static approach, in order to accurately capture the behavior of the kernel, the sampling period
cannot be short.
To illustrate our point, we selected two applications (MUM and LIB) from this group and, by
analyzing their statistics we observed that, if we use a long enough sampling period, we can apply
our techniques (core shutdown and memory V/F scaling) to such kernels as well. In analyzing
these two applications we used a 100K cycle sampling window which gives us a consistent behavior
over each window. On the other hand, dynamic-interval approach is applicable on kernels which
have many blocks/CTAs too execute. In such kernels, we launch as many CTAs as possible to
the SMs (Num-SMs*Max-CTA-Per-SM) and when they finish we launch the next set of CTAs
to SMs. In such kernels, we can use every set of launched CTAs as a sampling period. The
intuition behind this approach is based on the similar behavior exhibited by the CTAs which
gives us consistent behavior over those intervals. Among LIB and MUM, only MUM has enough
CTAs to use the dynamic-interval strategy.
We could also use this approach in situations where a kernel is executed multiple times in
order to find the ideal point faster; however, if we use the whole kernel execution for monitoring
the behavior, we can make a more accurate decision.
Overhead: An important question to answer is that how can we shut down some of the SMs
in the middle of the execution of a kernel without incurring performance/power overhead caused
by migration/context-switch among SMs? To address this, when we perform the binary search
over successive sampling periods, we pause the SMs instead of shutting them down. After finding
the ideal number of active SMs, we do not assign any more CTAs to the SMs that are identified
to be turned off, instead we keep them paused. When other active SMs finish their CTAs, we
can turn them off and start the paused ones again. Such a pausing approach will not cause any
migration/context-switch overhead among SMs.
19
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Ba
nd
wid
th U
tili
za
tio
n
Sequence of Launched Kernels
Kernel_1 Kernel_2
CI
MI
Figure 4.2. Transitioning between the compute-intensive(CI) and memory-intensive(MI) states in PVC.
4.3.2 Kernels with Different Behavior over Different Launches (Outlier2-
Scheme):
Among the benchmarks in our experimental suite, we found only two applications (PVC and
PVR) which potentially can use less number of active SMs for some kernel launches but they do
not have similar IPCs over different launches. However, for this type of applications, we observe
a quite stable MBU over consecutive launches. In more detail, we noticed that such programs
have multiple nested loops, which results in having lots of successive calls to a kernel and then
a sequence of successive calls to another kernel, and over again (can be seen in Figure 4.2). We
observe similar behavior within each sequence of successive calls to the same kernel but two
different sequence of calls (to the same kernel) have different behaviors in terms of IPC. These
applications appear to have consistent MBU over successive launches within each loop iteration,
which motivates us to use this metric as our knob instead of IPC.
We observe that some of these kernels exhibit memory-intensive behavior for tens of successive
launches and then switch to compute-intensive state. Figure 4.2 plots such transitions for kernel-
2. Therefore, when we optimize the number of SMs for memory-intensive periods, we need to
keep monitoring the MBU. In case we observe a sudden decrease in MBU, that means we are
facing a transition to compute-intensive state and we need to activate all the SMs and potentially
20
reconfigure the memory system to a lower V/F. A similar scenario can happen for transitioning
back to the memory intensive state from compute-intensive state. In Figure 4.2, kernel-1 is always
in the memory-intensive mode. On the other hand, kernel-2 switches between the memory- and
compute-intensive states, which is captured by monitoring its MBU.
We observed that, if a kernel is in saturated mode, reducing the number of SMs does not affect
the memory bandwidth utilization, as long as we are in saturated region. Knowing this property,
we can use the MBU as our knob instead of IPC for this type of applications, meaning that we
can keep reducing the number of cores as long as we observe the same bandwidth utilization.
Besides, we noticed that for this class of applications the threshold we defined for recognizing
memory-intensive kernels seems to be very accurate for these applications. Therefore, if we keep
decreasing the number of SMs, as long as MBU does not drop below that threshold, we do not
lose noticeable performance.
4.4 Putting it Together
Algorithm 1 represents our high-level strategy. The IdealConfig function represents the ideal
number of SMs and ideal memory V/F for a kernel. The Iteration function represents the config-
uration of a GPU through each step of CSS/MSS. After the execution of the kernel, UpdateStatus
function collects the required statistics. All these logic is implemented in the Logic-Unit, shown
in Fiugre 4.3. As can been seen in Algorithm 1, we can either ignore outliers and let them run
under the base configuration or use the proposed schemes to improve them as much as possible.
4.5 Microarchitecture and Hardware Overhead
Figure 4.3 illustrates the high level view of the proposed architecture. After the execution of a
kernel, the logic unit collects the required statistics and determines the next step in the binary
search process. After reaching to the ideal configuration, the logic unit sets the number of active
21
Algorithm 1 Pseudo code representing high-level strategyif (KernelID.IdealConfig() = NotFound) then
if (KernelID.SingelCall()) then//Apply Outlier1-Scheme
elseGPUConfig -> KernelID.Iteration();KernelID.UpdateStatus();if (KernelID.Monitor() = Outlier2) then//Apply Outlier2-Scheme
end ifend if
elseGPUConfig -> KernelID.IdealConfig();
end if
SMs and memory frequency/voltage of the GPU to its ideal configuration for future invocations
of that kernel.
In our approach, we store the IPC and MBU of each individual kernel. These numbers will be
used during the binary search and also, after reaching to the ideal point we need to store that ideal
configuration for future invocations. Therefore, for each kernel we need to store four variables:
IPC of the first execution (floating point), memory bandwidth utilization of the first execution
(floating point), optimal number of active SMs (5bits), and optimal memory frequency/voltage
(3bits). Therefore, each kernel needs a 9-byte storage to keep its information. In our experiments,
Cro
ssb
ar
Memory Controller
#Trans
...
...
L2 Cache
IPC(32 bits)
MBU(32 bits)
Core-Config(5 bits)
Mem-Config(3 bits)
#Inst
Fetch/Decode
Scoreboard
...
... ...
...
... ... ...
Logic Unit
Kernel
Launch
#Cycle
#Cycle
Shader Core
Figure 4.3. Microarchitecture design of the proposed techniques.
22
we did not encounter an application with more than 20 kernels. Therefore, even if we consider
such scenario we need a table with 20 entries which is equal to 180 bytes. Note that, we do not
need to be worry about the applications with too many kernels. Because not all these kernels
will be launched multiple times; consequently, if we apply a simple LRU technique to manage
the content of this table, the kernels which are executed only once will be evicted from the table
automatically.
Figure 4.3 shows the microarchitecture design of the proposed dynamic kernel-based recon-
figuration technique. We also assumed that, each SM has 2 counters (overall, 32*2*4 B) to track
the number of executed instructions and number of core cycles, in order to calculate the IPC.
Besides, each memory channel needs 2 counters (overall, 6*2*4 B) to track the number of memory
transactions and number of memory cycles, in order to calculate the MBU. Therefore, our tech-
nique has an overall hardware overhead of 484 bytes. We use these counters for each individual
SM/channel to make it adaptable to multiple-kernel case which is explained in Section 5.5.
23
Chapter 5 |
Experimental Results
5.1 Methodology
Platform: In order to evaluate our proposal, we used GPGPU-Sim v3.2.1 [24]. The details
of the simulated configuration are listed in Table 5.1. This configuration is similar to GTX480
configuration. In our experiments, we changed the number of active SMs between 1 and 32, and
used 32 in our baseline. We also used 7 different memory voltage values, between 0.55 V and
1 V, with equal intervals. Corresponding to these voltage values, we used 7 different memory
frequency values, between 500 MHz and 2 GHz. Our baseline uses 1 V and 2GHz.
Benchmarks: Table 5.2 lists the applications we used in our evaluations. We consider a wide
range of applications from various benchmark suites: CUDA SDK [25], Rodinia [26], Parboil [27],
Mars [28], Shoc [29], and LonestarGPU [30]. We classify these applications/kernels as compute-
intensive (Type-C), memory-intensive (Type-M), or outlier (Type-I).
Performance Metrics: In this work, we focus on energy efficiency, thus we report three metrics.
First, we report application performance in terms of normalized IPC with respect to the baseline
configuration described in Table 5.1. Second, we report the power consumption of the system
using GPUWattch [5]. In particular, we focus on dynamic power, leakage power, and DRAM
24
Table 5.1. Baseline configuration. We normalize the results with respect to the number of cores andmemory voltage/frequency configuration denoted in bold.
SM Config. 1-32 Shader Cores, 1400MHz, SIMT Width = 32GPU Resources / Core Max.1536 Threads (48 warps, 32 threads/warp),
48KB Shared Memory, 32684 RegistersCaches / Core 16KB 4-way L1 Data Cache, 12KB 24-way Texture,
8KB 2-way Constant Cache, 2KB 4-way I-cache,128B Line Size
L2 Cache 128 KB/Memory Partition, 128B Line Size, 8-way700MHz
Default Warp Scheduler Greedy-then-oldest [21]Features Memory Coalescing, Inter-warp Merging,
Immediate Post Dominator [22]Interconnect Crossbar, 1400MHz,
32B Channel WidthMemory Model 6 GDDR5 MCs, 500MHZ-2GHZ, 0.55-1 V
FR-FCFS [23], 8 DRAM-banks/MCGDDR5 Timing tCL = 12, tRP = 12, tRC = 40, tRAS = 28, tCCD = 2,
tRCD = 12, tRRD = 6, tCDLR = 5, tW R = 12
Table 5.2. List of GPU benchmarks: Type-M: applications with MBU > 40%. Type-O: outlier appli-cations. Type-C: applications with MBU < 10%. In the last column, M and C refers to memory- andcompute-intensive kernels
# Suite Application Abbr. Type Kernel1 Lonestar Single-Src Shortest Paths SSSP Type-M M2 Lonestar Breadth-First Search BFS Type-M M3 Lonestar Survey Propagation SP Type-M M4 Lonestar Minimum Spanning Tree MST Type-M M-C5 Parboil Saturating Histogram HIST Type-M M-C6 Shoc 2D Stencil Computation stencil Type-M M7 SDK MUMerGPU MUM Type-O M8 SDK LIBOR Monte Carlo LIB Type-O M9 Mars Kmeans Clustering Kmean Type-O M-C10 Mars Page View Count PVC Type-O M-C11 Lonestar Delaunay Mesh Refinement DMR Type-C C12 Rodinia Cardiac Myocyte MYO Type-C C13 Shoc Sparse Matrix Vector Multi. SPMV Type-C M-C14 Shoc Lennard-Jones potential MD Type-C C15 Parboil Matrix Multiplication MM Type-C C16 Mars Page View Rank PVR Type-C M-C
power. Third, based on performance and power results, we calculate the energy consumption
of the system. The results presented below includes all the runtime overheads brought by our
approach.
25
16
8 4
2 3 3 3 3
16
6 7 7 7 7
32 32 32 32 32 32 32 32 32
0
5
10
15
20
25
30
35
0 1 2 3 4 5 6 7 8 9 10
Nu
mb
er
of
SM
s
Sequence of Invocations
Kernel-1 kernel-2 Kernel-3
MBU-1 ≈ 55% MBU-2 ≈ 50% MBU-3 ≈ 20%
Figure 5.1. Stabilization of binary search (ideal number of SMs) for three different kernels from HISTapplication.
1 1 1 1 1 1
0.68
0.81
0.75 0.75 0.75
0.5
0.6
0.7
0.8
0.9
1
1.1
0 1 2 3 4 5 6 7
Me
mo
ry V
olt
ag
e L
ev
els
Sequence of Invocations
Kernel-1 Kernel-2
MBU-1 ≈ 55% MBU-2 ≈ 5%
Figure 5.2. Stabilization of binary search (ideal memory V/F) for two main kernels from PVC appli-cation.
5.2 Dynamism of the CSS and MSS Techniques
Figure 5.1 shows how our CSS technique stabilizes after five kernel invocations. As can be seen,
these three kernels from HIST application exhibit different SM count demands. Among these
three kernels, MBU of kernel-3 is below the defined threshold for memory intensiveness, and it
effectively exploits all 32 SMs. On the other hand, MBUs of kernel-1 and kernel-2 are above that
threshold and CSS allocates them much lesser number of SMs without incurring any performance
loss. Figure 5.2 shows how our MSS technique stabilizes after three steps. As can be observed,
these two kernels from PVC appear to have different memory utilizations. The MBU of kernel-1
is higher than the threshold for compute-intensive kernels. Therefore, our technique does not try
to change the memory V/F for this kernel. On the other hand, the MBU of kernel-2 is less than
26
that threshold and our technique finds its ideal memory V/F after three steps.
5.3 Evaluation of CSS
Figure 5.3 shows the dynamic power consumption breakdown of the memory-intensive kernels of
different GPGPU applications in the base configuration (i.e., without any energy optimization).
For each of these kernels, memory power consumption contributes to a considerable portion of
the total dynamic power consumption.
0%
20%
40%
60%
80%
100%
Dyn
am
ic P
ow
er
Co
nsu
mp
tio
n
Bre
akd
ow
n
SM Caches Shared Memory Register File
Cores L2, MC, DRAM NOC
Figure 5.3. Dynamic power consumption breakdown across memory-intensive kernels (baseline, withoutany energy optimization).
Static and Dynamic Power Consumption: Figure 5.4 reports the portion of the total
energy that belong to the static and dynamic power consumption. As can be seen, for some
applications only the static power consumption is improved. As we keep decreasing the number
of active SMs, for kernels running in the saturated region, we potentially decrease the dynamic
power consumption in two ways: First, the kernel will experience less resource contention during
the execution [2]. Second, if the kernel has already incurred some performance loss, it will
experience a shorter execution time by only using the ideal number of SMs. As can be seen in
Figure 5.7, SSP and BFS applications, which had a considerable performance degradation in the
base configuration, gain most in terms of dynamic power saving.
On the other hand, after finding the ideal number of SMs, we power-gate the idle ones.
Therefore, during the course of execution, we have less number of active SMs, which helps in
27
lowering the static power consumption. Therefore, the energy saving reported for static power
consumption in Figure 5.4, is linearly dependent on the number of power gated SMs.
It is also to be noted that, among these applications, our technique appears to be less effective
on MST and Stencil. Figure 5.5 shows the performance of these two applications as we vary the
number of SMs from 10 to 48 (note that, we have only 32 SMs in our base configuration). As can
be observed, MST and Stencil go into the saturated region beyond 30 SMs. The reason why our
technique is less effective for these two applications compared to the other applications is that,
with around 32 SMs, these two applications have just crossed the saturation point; therefore, the
scope for reducing energy consumption is not much.
0
0.2
0.4
0.6
0.8
1
SS
SP
-Ba
se
SS
SP
-CS
SS
SS
P-O
ptim
al
BF
S-B
as
eB
FS
-CS
SB
FS
-Op
tima
l
SP
-Ba
se
SP
-CS
SS
P-O
ptim
al
HIS
T-B
as
eH
IST
-CS
SH
IST
-Op
tima
l
MS
T-B
as
eM
ST
-CS
SM
ST
-Op
tima
l
Ste
nc
il-Ba
se
Ste
nc
il-CS
SS
ten
cil-O
ptim
al
MU
M-B
as
eM
UM
-CS
SM
UM
-Op
tima
l
LIB
-Ba
se
LIB
-CS
SL
IB-O
ptim
al
PV
C-B
as
eP
VC
-CS
SP
VC
-Op
tima
l
Km
ea
n-B
as
eK
me
an
-CS
SK
me
an
-Op
tima
l
HM
EA
N-B
as
eH
ME
AN
-CS
SH
ME
AN
-Op
tima
l
No
rmalized
En
erg
y C
on
su
mp
tio
n
Dynamic Static
Figure 5.4. Energy saving gained by using optimal number of SMs and power gating the rest of SMs.
0.75
0.8
0.85
0.9
0.95
1
1.05
10 20 30 40 50
No
rma
lize
d I
PC
Number of cores
MST Stencil
Figure 5.5. Analyzing the saturated region for MST and Stencil.
Performance: Figure 5.6 gives the average number of SMs CSS stabilized at, as the ideal
point, compared to the optimal number of SMs, which is calculated offline. Figure 5.7 shows
28
0
5
10
15
20
25
30
35
SSSP BFS SP HIST MST Stencil MUM LIB
Nu
mb
er
of
Co
res
Base-Configuration Our-Configuration Optimal-Configuration
Figure 5.6. Average number of cores.
the impact of our technique on performance. As can be observed, the first four applications
experience severe performance loss if we allocate them all the available SMs. Our technique
improves the performance of these applications by 12% on average. In addition to the memory
bandwidth saturation problem, these applications also suffer from high resource contention among
concurrently-running threads. We observed that, in the case of SSSP and BFS, the L2 cache
miss-rate increases from 32% and 45% (for optimal number of SMs) to 78% and 77% (for 32
SMs), respectively. Our CSS technique degrades the performance of the remaining applications
by less than 2%.
0.8
0.9
1
1.1
1.2
1.3
No
rma
lize
d I
PC
Figure 5.7. Normalized IPC values for different applications with respect to the baseline.
Outlier Kernels: MUM and LIB belong to the first class of irregular applications (Sec-
tion 4.3.1) where a kernel is executed only once. On the other hand, PVC and Kmean belong to
the second class of outlier applications (Section 4.3.2).
Note that, in this paper our focus is on kernels which are executed multiple times and along
29
with that, we explained how those techniques can be applied to the outliers. In this line, we
can recognize the kernels which will be executed only once at compile time. On the other hand,
intense traditions in MBU tells us that we are facing the second class of outliers. After recognizing
the outliers, we can either ignore them or apply the proposed customized techniques on them.
Summary of Results: Based on the experimental results, CSS achieves up to 35% and
on average (harmonic-mean)about 20% energy saving, which is within 8% of the optimal saving
(Figure 5.4). This technique improves the performance of 4 applications 12% on average, and
leads to a performance loss of 2% on average for other applications.
5.4 Evaluation of MSS
Since our goal is to optimize energy consumption without incurring performance loss, we do
not attempt to employ any DVFS technique on memory unless the kernel has few memory
transactions. Figure 5.8 shows the effect of employing DVFS technique on memory, for different
GPGPU applications. Note that, over all these applications, MSS technique is as accurate as the
optimal configuration. In compute intensive kernels, memory power consumption contributes
to a small portion of total dynamic power consumption because there are not many memory
transactions. For such kernels, the main portion of memory power consumption is its leakage
power which can be reduced by lowering the voltage. The impact of such DVFS technique on the
total power consumption is not much but from the memory point of view, it is a considerable gain.
As can be seen in Figure 5.8, the first three application have the highest reduction in leakage
power. These applications basically have very few memory transactions and our technique scales
the V/F of the memory to the lowest point available.
Transitions between compute- and memory-intensive states: The last two bars (PVR
and PVC) in Figure 5.8, exhibit a transient behavior over time and only during compute-intensive
phases MSS can apply DVFS technique on memory. Therefore, MSS technique has less oppor-
30
0
0.2
0.4
0.6
0.8
1
No
rma
lize
d M
em
ory
L
ea
ka
ge
Po
we
r
Figure 5.8. Normalized memory leakage power for different application with respect to the baseline.
tunity to reduce memory leakage power for these two applications. As can be seen in figure 4.2,
the MBU of kernel-1 from PVC is always high and we cannot apply memory DVFS technique on
that. On the other hand, the MBU of kernel-2 is below 10% over CI phase and more than 50%
during MI phase. Our algorithm dynamically detects such transitions between compute- and
memory-intensive phases. For instance, during the CI phase in figure 4.2, Kernel-2 is assigned
all the 32 SMs and its memory voltage is reduced to half, while in MI phase its memory V/F is
set to the highest and the number of SMs is set to 25.
5.5 Running Multiple Applications Concurrently
In this subsection, we discuss how our approach works in a multi-application setting. Most of
the prior works on multi-kernel execution [1, 31, 32] assign different kernels to different SMs,
assuming half and half partitioning of the SMs among kernels. Based on this approach, our
technique monitors the MBU of each individual kernel as well as the overall MBU, in order to
separately find ideal number of SMs for each kernel. After finding the ideal number of SMs for a
memory-intensive kernel, this approach might decide to assign more SMs to the other kernel if it is
recognized as compute-intensive. In this approach, we can decide to shut down some of the SMs if
our scheme detects that none of the concurrently executing kernels will benefit from the additional
SMs. Note that, the microarchitecture reported in Section 4.5 already supports this mechanism.
On the other hand, our memory-side technique is based on memory-related statistics; not based
31
on the number of concurrently executing kernels, meaning that we decide to apply DVFS on
memory if we are sure that none of the concurrently running kernels is memory intensive. As
an example, we picked a memory-intensive (GUPS) and relatively compute-intensive (JPEG)
application and ran them together on a 32-core system, assuming an even partitioning of the
SMs across the applications. We observed that the MBU of GUPS and JPEG is around 60% and
15%, respectively. Therefore, our algorithm recognizes GUPS as memory-intensive and using
CSS it finds its ideal number of SMs which is about 7 SMs (3% performance loss) instead of 16
SMs. Since we tag JPEG as compute-intensive, we can assign those extra 9 SMs to that kernel,
knowing that it will not affect overall MBU. As a result, the performance of JPEG improves 19%
without noticeable increase in memory traffic.
32
Chapter 6 |
Related Work
Theoretically speaking, assigning more SMs to a highly multi-threaded application, improves its
performance as long as the memory bandwidth does not saturate. Huang et al. [33] evaluated
the effect of number of active SMs on energy consumption, and discussed that having all the SMs
activated is the most energy efficient configuration. The lack of that study is that they did not
consider any memory-intensive application. In order to have a more accurate analysis, we need
to consider the possible congestion in the interconnection network and the contention in last level
cache caused by enormous number of memory requests (issued by huge number of concurrently
running threads). In this line, Guz et al. [2] showed that increasing the parallelism improves the
performance as long as the memory access latency is not affected considerably. Suleman et al. [15]
proposed a feedback-driven-threading to control the number of concurrently running threads in
CMPs, in order to balance the overhead of data-synchronization and level of parallelism.
There has been some prior works, in both the CPU and GPU domain, that determine the
optimal number of cores for a particular application/kernel in order to improve the performance
or reduce the power consumption. Considering the fact that, any bottleneck on memory side (i.e.,
congestion in the interconnection network, contention in the L2 cache, and saturation of DRAM
memory bandwidth) causes frequent SM stalls, we can categorize the prior works as follows.
33
Modeling-Based. Li and Martinez [34] analytically estimated the optimal number of pro-
cessors to achieve the best EDP in CMPs. In GPU domain, Hong et al. [3] proposed an analytical
model which predicts the optimal number of SMs offline. This model determines the optimal
number of SMs for a particular kernel based on its memory demand. These models do not cover
the possible congestion in the interconnection network and the contention in last level cache which
might lead to overestimating the number of required SMs under such circumstances. Besides, in
Section 3, we demonstrate that some kernels might exhibit considerably different behaviors over
different invocations as the dynamic parameters, like the kernel’s input, change. The modeling
techniques fail in capturing such variations in run-time behavior.
Throttling-Based. Throttling is an approach for managing the degree of concurrency in a
multi-threaded application. Rogers et al. [?] and Kayiran et al. [17] both employ a warp-throttling
technique for handling the contention in the L1 cache. Thread throttling techniques can reduce
the pressure on the side of memory; consequently, SMs will experience fewer and shorter stall
times because of the smaller data access latencies. In the current work, we adopt a core shut-
down technique which not only adjusts the degree of concurrency, but also reduces the core static
leakage power that is a major contributor to the total chip power consumption [5]. Besides, we
employ a throttling-based technique (i.e., pausing technique) during our search process, and
apply the core shut-down once the ideal number of active SMs is determined.
DVFS-Based. Similar to the throttling-based techniques, DVFS-based techniques can be
used to reduce the occurrence and the length of the SM stall times by lowering the frequency
of the SMs which in turn reduces the number of generated memory requests per unit of time.
Although DVFS techniques are more effective in reducing the SM leakage power compared to
the throttling, the core shut-down approach is yet more effective in that sense. Lee et al. [4]
analyze the impact of DVFS on the SMs as well as the interconnection network under a fixed
power budget. They demonstrate that wisely distributing the power budget could improve the
system performance. Leng et al. [5] employs a DVFS technique for a different purpose. They
34
propose to reduce the V/F of the SMs only during the SM stall times. Note that, a core-side
DVFS cannot resolve the contention in the L2 cache because cache-contention is a function of the
sequence of the accesses not the time of the accesses. Overall, DVFS techniques are orthogonal
to the throttling. Throttling techniques can reduce the occurrence and the length of the SM stall
times, while a parallel DVFS technique can be applied during the inevitable SM stall times.
Optimal Memory Frequency/Voltage: There are several prior works on power consump-
tion of memory. A category of the prior works on the power consumption of memory systems
tries to reshape the memory traffic in order to keep some channels/ranks/banks in the sleep
mode [35, 36]. These techniques are not applicable to GPUs because, for each individual kernel,
we need to detect and migrate the working data-set, which makes the overhead of data migration
too much. From a different perspective, David et al. [12] employed a DVFS technique to reduces
the power consumption of memory. Following that work, Deng et al. [13] proposed MultiScale
which applies DVFS on the memory systems with multiple controllers. CoScale [?] proposes a
coordinated CPU and memory system DVFS technique for the servers. This coordinated fash-
ion approach is expected to avoid the possible conflicts where CPU and memory system have
separate DVFS controllers. Our proposed memory side DVFS technique is basically a common
DVFS technique which works on the kernel granularity, as opposed to other DVFS techniques.
35
Chapter 7 |
Conclusions
In this paper, we showed that allocating all the SMs to memory-intensive kernels is not neces-
sarily energy-efficient, and it does not achieves the highest performance either. Based on this
observation, we proposed a technique that dynamically determines the ideal number of active
SMs for each kernel and power gates the remaining SMs. This technique reduces the dynamic and
static power consumption, significantly. More specifically, over a wide variety of GPGPU applica-
tions, it reduced energy consumption up to 36% and about 20% on average. On the other hand,
memory does not necessarily need to be set to its fastest configuration for the compute-intensive
kernels. Therefore, we can apply DVFS technique on memory in case a compute-intensive kernel
is running. We showed that, by scaling the memory voltage for compute-intensive kernels, we can
reduce memory leakage power 70% on average, without affecting the overall system performance.
36
Bibliography
[1] Kepler, N. “NvidiaâĂŹs next generation cuda compute architecture,” .URL http://www.nvidia.com/kepler
[2] Guz, Z., E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. Weiser (2009)“Many-Core vs. Many-Thread Machines: Stay Away From the Valley,” Computer Architec-ture Letters.
[3] Hong, S. and H. Kim (2010) “An Integrated GPU Power and Performance Model,” inProceedings of the 37th Annual International Symposium on Computer Architecture, ISCA’10.
[4] Lee, J., V. Sathisha, M. Schulte, K. Compton, and N. S. Kim (2011) “Improv-ing Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and CoreScaling,” in Parallel Architectures and Compilation Techniques (PACT), 2011 InternationalConference on.
[5] Leng, J., T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt,and V. J. Reddi (2013) “GPUWattch: Enabling Energy Optimizations in GPGPUs,”SIGARCH Comput. Archit. News.
[6] Fermi, N. “NvidiaâĂŹs next generation cuda compute architecture,” .URL http://www.nvidia.com/fermi
[7] Lee, S. Y. and C. J. Wu (2014) “Characterizing the Latency Hiding Ability of GPUs,”in International Symposium on performance Analysis of Systems and Software (ISPASS),2014, poster.
[8] Abdel-Majeed, M. and M. Annavaram (2013) “Warped Register File: A Power EfficientRegister File for GPGPUs,” in Proceedings of the 2013 IEEE 19th International Symposiumon High Performance Computer Architecture (HPCA), HPCA ’13.
[9] Yuan, G. L., A. Bakhoda, and T. M. Aamodt (2009) “Complexity Effective MemoryAccess Scheduling for Many-core Accelerator Architectures,” in Proceedings of the 42NdAnnual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, ACM.
[10] Singh, I., A. Shriraman, W. W. Fung, M. O’Connor, and T. M. Aamodt (2013)“Cache coherence for GPU architectures,” in High Performance Computer Architecture(HPCA2013), 2013 IEEE 19th International Symposium on.
[11] Hechtman, B. A., S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill,S. K. Reinhardt, and D. A. Wood (2014) “QuickRelease: A Throughput-oriented Ap-proach to Release Consistency on GPUs,” in High Performance Computer Architecture(HPCA2014), 2014 IEEE 20th International Symposium on.
37
[12] David, H., C. Fallin, E. Gorbatov, U. R. Hanebutte, and O. Mutlu (2011) “MemoryPower Management via Dynamic Voltage/Frequency Scaling,” in Proceedings of the 8th ACMInternational Conference on Autonomic Computing, ICAC ’11, ACM.
[13] Deng, Q., D. Meisner, A. Bhattacharjee, T. F. Wenisch, and R. Bianchini (2012)“MultiScale: Memory System DVFS with Multiple Memory Controllers,” in Proceedingsof the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design,ISLPED ’12, ACM.
[14] Gebhart, M., D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lind-holm, and K. Skadron (2011) “Energy-efficient Mechanisms for Managing Thread Contextin Throughput Processors,” in Proceedings of the 38th Annual International Symposium onComputer Architecture, ISCA ’11, ACM.
[15] Suleman, M. A., M. K. Qureshi, and Y. N. Patt (2008) “Feedback-driven Threading:Power-efficient and High-performance Execution of Multi-threaded Workloads on CMPs,”SIGARCH Comput. Archit. News.
[16] Williams, S., A. Waterman, and D. Patterson (2009) “Roofline: An Insightful VisualPerformance Model for Multicore Architectures,” Commun. ACM, pp. 65–76.
[17] Kayiran, O., A. Jog, M. T. Kandemir, and C. R. Das (2013) “Neither More Nor Less:Optimizing Thread-Level Parallelism for GPGPUs,” in PACT.
[18] Microarchitecture, I. N. .URL http://www.intel.com/technology/architecture-silicon/next-gen/
[19] Kim, N. S., T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. S. Hu, M. J. Irwin,M. Kandemir, and V. Narayanan (2003) “Leakage Current: Moore’s Law Meets StaticPower,” Computer.
[20] technology model, P. .URL http://ptm.asu.edu
[21] Rogers, T. G., M. O’Connor, and T. M. Aamodt (2012) “Cache-Conscious WavefrontScheduling,” in MICRO.
[22] Fung, W., I. Sham, G. Yuan, and T. Aamodt (2007) “Dynamic Warp Formation andScheduling for Efficient GPU Control Flow,” in MICRO.
[23] Rixner, S., W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens (2000) “MemoryAccess Scheduling,” in ISCA.
[24] Bakhoda, A., G. Yuan, W. Fung, H. Wong, and T. Aamodt (2009) “Analyzing CUDAworkloads using a detailed GPU simulator,” in Performance Analysis of Systems and Soft-ware, 2009. ISPASS 2009. IEEE International Symposium on.
[25] NVIDIA (2011), “CUDA C/C++ SDK Code Samples,” .URL http://developer.nvidia.com/cuda-cc-sdk-code-samples
[26] Che, S., M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron(2009) “Rodinia: A Benchmark Suite for Heterogeneous Computing,” in IISWC.
[27] Stratton, J. A., C. Rodrigues, I. J. Sung, N. Obeid, L. W. Chang, N. Anssari,G. D. Liu, and W. W. Hwu (2012) Parboil: A Revised Benchmark Suite for Scientific andCommercial Throughput Computing, Tech. Rep. IMPACT-12-01, University of Illinois, atUrbana-Champaign.
38
[28] He, B., W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang (2008) “Mars: A MapRe-duce Framework on Graphics Processors,” in PACT.
[29] Danalis, A., G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford,V. Tipparaju, and J. S. Vetter (2010) “The Scalable Heterogeneous Computing (SHOC)benchmark suite,” in GPGPU.
[30] Burtscher, M., R. Nasre, and K. Pingali (2012) “A quantitative study of irregularprograms on GPUs,” in IISWC.
[31] Adriaens, J., K. Compton, N. S. Kim, and M. Schulte (2012) “The case for GPGPUspatial multitasking,” in High Performance Computer Architecture (HPCA), 2012 IEEE18th International Symposium on.
[32] Pai, S., M. J. Thazhuthaveetil, and R. Govindarajan (2013) “Improving GPGPUConcurrency with Elastic Kernels,” SIGPLAN Not.
[33] Huang, S., S. Xiao, and W. Feng (2009) “On the Energy Efficiency of Graphics ProcessingUnits for Scientific Computing,” in Proceedings of the 2009 IEEE International Symposiumon Parallel&Distributed Processing, IPDPS ’09, IEEE Computer Society.
[34] Li, J. and J. F. Martínez (2005) “Power-performance Considerations of Parallel Comput-ing on Chip Multiprocessors,” ACM Trans. Archit. Code Optim.
[35] Huang, H., K. G. Shin, C. Lefurgy, and T. Keller (2005) “Improving Energy Effi-ciency by Making DRAM Less Randomly Accessed,” in Proceedings of the 2005 InternationalSymposium on Low Power Electronics and Design, ISLPED ’05, ACM.
[36] Luz, V. D. L., M. Kandemir, and I. Kolcu (2002) “Automatic Data Migration forReducing Energy Consumption in Multi-bank Memory Systems,” in Proceedings of the 39thAnnual Design Automation Conference, DAC ’02, ACM.
39
Recommended