KERNEL-BASED ENERGY OPTIMIZATION IN GPUS

The Pennsylvania State University

The Graduate School

College of Engineering

A Thesis in

Computer Science and Engineering

Amin Jadidi

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Master of Science

December 2015

The thesis of Amin Jadidi was reviewed and approved∗ by the following:

Chita Das

Professor of Computer Science and Engineering

Thesis Advisor

Mahmut T. Kandemir

Professor of Computer Science and Engineering

Raj Acharya

Head of the Department of Computer Science

∗Signatures are on file in the Graduate School.

Abstract

Emerging GPU architectures offer a cost-effective computing platform by providing thousandsof energy-efficient compute cores and high bandwidth memory that facilitate the execution ofhighly parallel applications. In this paper, we show that different applications, and in fact differ-ent kernels from the same application might exhibit significantly varying utilizations of computeand memory resources. However, we observe that the same kernel displays similar behaviorin its different invocations; moreover, most of the kernels are invoked many times during thecourse of execution. By exploiting these properties of kernels, in order to improve the energy ef-ficiency of the GPU system, we propose a dynamic resource configuration strategy that classifieskernels as compute-intensive or memory-intensive based on their resource utilizations and dy-namically employs memory voltage/frequency scaling or core shut-down techniques for compute-and memory-intensive kernels, respectively. This strategy uses performance and memory band-width utilization information from the first few invocations of a kernel to determine the optimalhardware configuration for future invocations. Experimental evaluations show that our strategysaves about 20% of total chip energy and 70% of total memory leakage power for memory andcompute-intensive kernels respectively, which are within 8% of the optimal savings that can beobtained from an oracle scheme.

Table of Contents

List of Figures vi

List of Tables viii

Chapter 1Introduction 1

Chapter 2Background 4

Chapter 3Motivation 63.1 Investigating Resource Underutilization in GPUs . . . . . . . . . . . . . . . . . . 63.2 Quantifying Resource Underutilization in GPUs at the Kernel Granularity . . . . 93.3 Kernel-level Properties of GPGPU Applications . . . . . . . . . . . . . . . . . . . 123.4 Classifying GPGPU Kernels Dynamically at Runtime . . . . . . . . . . . . . . . 13

Chapter 4Kernel Based GPU Resource Management 154.1 CSS: Core-Side Energy Optimization Scheme . . . . . . . . . . . . . . . . . . . . 154.2 MSS: Memory-Side Energy Optimization Scheme . . . . . . . . . . . . . . . . . . 174.3 Handling Outlier Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3.1 Kernels that are Executed Only Once (Outlier1-Scheme): . . . . . . . . . 184.3.2 Kernels with Different Behavior over Different Launches (Outlier2-Scheme): 20

4.4 Putting it Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.5 Microarchitecture and Hardware Overhead . . . . . . . . . . . . . . . . . . . . . . 21

Chapter 5Experimental Results 245.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.2 Dynamism of the CSS and MSS Techniques . . . . . . . . . . . . . . . . . . . . . 265.3 Evaluation of CSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.4 Evaluation of MSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.5 Running Multiple Applications Concurrently . . . . . . . . . . . . . . . . . . . . 31

Chapter 6Related Work 33

Chapter 7Conclusions 36

Bibliography 37

List of Figures

2.1 Target GPGPU architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 Effect of increasing the number of cores on the performance of three applications.For each application, the results are normalized with respect to the highest IPCobserved with that application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Effect of memory frequency scaling on the performance of three applications.For each application, the results are normalized with respect to the highest IPCobserved in that application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3 Energy consumption of select benchmarks with respect to the baseline. Eachbar is normalized with respect to the energy consumption of the correspondingapplication in our base configuration (i.e., without any energy optimization). . . 9

3.4 Effect of increasing the number of cores on the performance of different kernelsfrom MST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.5 The two most important kernels in PVC that exhibit different bandwidth utiliza-tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.6 IPC of kernels over different invocations. Different invocations of the same kernelprovide very similar performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1 Performing binary search on the (a) number of SMs and (b) memory frequency/-voltage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 Transitioning between the compute-intensive(CI) and memory-intensive(MI) statesin PVC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Microarchitecture design of the proposed techniques. . . . . . . . . . . . . . . . . 22

5.1 Stabilization of binary search (ideal number of SMs) for three different kernelsfrom HIST application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Stabilization of binary search (ideal memory V/F) for two main kernels from PVCapplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.3 Dynamic power consumption breakdown across memory-intensive kernels (base-line, without any energy optimization). . . . . . . . . . . . . . . . . . . . . . . . . 27

5.4 Energy saving gained by using optimal number of SMs and power gating the restof SMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.5 Analyzing the saturated region for MST and Stencil. . . . . . . . . . . . . . . . . 285.6 Average number of cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.7 Normalized IPC values for different applications with respect to the baseline. . . 29

5.8 Normalized memory leakage power for different application with respect to thebaseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

List of Tables

5.1 Baseline configuration. We normalize the results with respect to the number ofcores and memory voltage/frequency configuration denoted in bold. . . . . . . . 25

5.2 List of GPU benchmarks: Type-M: applications with MBU > 40%. Type-O: outlierapplications. Type-C: applications with MBU < 10%. In the last column, M andC refers to memory- and compute-intensive kernels . . . . . . . . . . . . . . . . 25

Chapter 1 |

Introduction

GPUs are being increasingly employed to accelerate different types of computing platforms rang-

ing from embedded devices to supercomputers. As a result, today's GPUs are not running only

graphics applications, but also, applications from database domain and high-performance com-

puting domain, among others. To cope with contrasting demands of these different types of appli-

cations, GPU architects keep increasing on-chip resources such as cores, caches, software-managed

memories and memory controllers (MCs), and projections [1] include even more powerful/resource-

rich GPU systems in the near future. An important question at this juncture is whether current

applications effectively utilize available on-chip resources in GPUs and, if not, why and what can

be done about it. Several recent papers [2–5] have focused on this resource utilization problem

and proposed techniques that handle underutilized hardware components in GPUs.

Our own research shows that both cores and memory bandwidth are highly underutilized in

many GPU applications. Moreover, different kernels of the same application can have significant

variations regarding utilizations of cores and memory bandwidth, making a universal solution

that works across different applications highly unlikely. Motivated by this observation, this paper

proposes an energy-saving strategy that exploits resource underutilization in GPUs. The unique

aspect of this strategy is that it operates at a kernel granularity, and uses both core shut-down

and memory frequency/voltage scaling to achieve significant energy savings.

Our main contributions in this work can be summarized as follows:

• We present empirical evidence clearly showing the underutilization of cores as well as mem-

ory bandwidth in a set of GPU applications. Our experimental results indicate not only that

different applications exhibit different utilizations of these two critical resources, but also

even different kernels of the same application exhibit significant resource utilization varia-

tions. Based on these results, we classify kernels in our applications as memory-intensive

and compute-intensive.

• We show that, despite the variations across kernels, different invocations of the same kernel

exhibit very similar behavior under a fixed resource allocation. This observation, combined

with the fact that an average kernel in GPU applications is invoked many times during the

course of execution, motivates for a dynamic, history-based optimization strategy.

• We propose a kernel-level energy optimization strategy that capitalizes on these variations

across different kernels. Specifically, our strategy employs core shut-down for memory-

intensive kernels and memory frequency/voltage scaling for compute-intensive kernels. The

proposed binary search-based strategy uses the first few executions (invocations) of a kernel

to determine the ideal hardware configuration (in terms of the number of cores and memory

V/F) to be used in the remaining invocations.

• We present an experimental evaluation of our strategy using a set of GPU applications. The

results collected via a detailed simulation infrastructure indicate that our strategy saves

about 20% of total chip energy for memory-intensive and 70% of total memory leakage

power for compute-intensive kernels in our base architecture. We further show that these

savings are within 8% of optimal savings that could be obtained from a hypothetical scheme,

and are consistent across different values of our major simulation parameters.

• We also show in this paper how our approach handles outlier kernels that (i) are executed

only once or (ii) exhibit irregular behavior across successive invocations.

We believe this is the first paper that dynamically tunes the number of active SMs, at a

kernel level granularity, based on the memory utilization. In this line, Hong et al. [3] propose

an analytical model which predicts the ideal number of cores by using off-line information. In

another work, Lee et al. [4] analyze the impact of DVFS on cores/interconnect/caches and core

scaling techniques under a fixed power budget; however, they do not propose a dynamic scheme

for tunning these parameters at run-time.

The rest of this paper is structured as follows. Section 2 provides background on our target

GPU architecture. In Section 3, we describe the resource underutilization problem in GPUs. In

Section 4, we present our techniques for adaptive GPU reconfiguration. Section 5 presents an

experimental evaluation of the proposed strategy. Section 6 reviews related work and Section 7

concludes the paper.

Chapter 2 |

Background

In this section, we provide a brief background on the GPGPU architecture targeted by our work.

GPU Architecture: Our target GPU consists of multiple streaming multiprocessors (SMs),

each containing 32 CUDA cores [6]. Each CUDA core can execute a thread, in a “single-

instruction, multiple-threads” (SIMT) fashion. This architecture is supported by a large register

file that hides the latency of memory accesses [7,8]. The memory requests generated by multiple

concurrently executing threads in an SM are coalesced into fewer cache lines and sent to L1 data

cache, shared by all CUDA cores in the SM. Each SM also contains a read-only texture and

Last Level $ (L2)

Memory Controller

Interconnection Network (Crossbar)

Streaming Multiprocessor

(SM)...

CPU LaunchCUDA Cores

Data $Constant $Texture $

Shared Memory

Last Level $ (L2)

Memory Controller

Last Level $ (L2)

Memory Controller

Streaming Multiprocessor

Figure 2.1. Target GPGPU architecture.

a constant cache, and a low-latency software-managed shared memory. Misses in these caches

are injected into the network, which connects the SMs to 6 memory partitions through a cross-

bar [9]. Each memory partition includes a slice of shared L2 cache, and memory controllers

(MCs) [10, 11]. Figure 2.1 shows this baseline architecture. We assume that per-SM power-

gating [3,4], and multiple voltage-frequency states for memory controllers [5,12,13] are available

to our baseline.

GPGPU Applications: The execution on GPU starts with the memory allocation in GPU

memory by the CPU. Then, CPU copies the required data into this allocated memory, and

a kernel is launched on GPU. During the course of execution of an application, CPU generally

launches multiple kernels on GPU. Each kernel consists of a set of parallel lightweight threads [14].

Once all the threads finish their computation, resulting data are copied to the CPU memory and

the GPU memory is freed. The number of threads in a kernel is a good indicator if the kernel is

scalable up to the number of SMs present in the system. Kernels containing only a few threads

would keep most of the core resources idle, thus the application scalability would be the limiting

factor of performance. Moreover, high memory demands of these concurrently-executing threads

might saturate the memory bandwidth, causing the insufficient memory bandwidth to be the

limiting factor of performance [3].

Chapter 3 |

Motivation

In this section, we characterize our applications at the kernel level to understand their resource

(core and memory) usage profiles. This characterization demonstrates the need for a dynamic

resource management scheme.

3.1 Investigating Resource Underutilization in GPUs

The main philosophy of GPGPU architectures is to provide a large number of computing cores

supported by a high bandwidth memory, in order to have a high throughput system. Such a

resource-rich design will be cost-effective only if the main resources such as computing cores

and memory are effectively utilized by hosted applications. Thus, it is vital to understand the

impact of different applications on the utilization of GPU resources. Typically, memory-intensive

applications utilize the memory bandwidth, but might not need all the cores to be active in

order to optimize its performance. On the other hand, compute-intensive applications utilize the

compute cores well, but leave the memory bandwidth underutilized.

To illustrate the effect of the number of cores on the system, let us examine Figure 3.1 (our

experimental setup will be given in Section 5.1). This figure shows the application performance

0 5 10 15 20 25 30N

alized

Number of Cores

MUM PATH BFS

Figure 3.1. Effect of increasing the number of cores on the performance of three applications. For eachapplication, the results are normalized with respect to the highest IPC observed with that application.

(normalized IPC), as we vary the number of available cores in three representative applications

(PATH, MUM and BFS) exhibiting different behaviors. Among these applications, PATH is

the only compute-intensive application, and its performance increases linearly as we increase

the number of cores, meaning that the application utilizes all the available GPU computational

resources.

The other applications are memory-intensive, and we observe that their performance does not

improve beyond a certain point. In fact, we even see some performance loss in BFS. In order to

understand the underlying reason for the saturation in performance with the increasing number

of cores, we need to look at the memory bandwidth demands of GPGPU applications.

Each thread of a GPGPU application has a certain memory bandwidth demand, and assuming

that an application has enough threads, as we increase the number of cores, we also increase the

number of concurrently-running threads, which in turn leads to an increase in the number of

memory transactions (read/write requests) from cores to memory per unit time. Beyond a

certain number of cores, this increase in memory traffic might cause the memory bandwidth to

saturate. Beyond this number, using additional cores will not improve performance, instead it

might lead to longer latency for memory accesses and longer stall times for the cores [15,16].

Besides, an application using an excessive number of concurrently-running threads might

suffer from severe resource contention [2,17], which further exacerbates the problem. If we could

somehow know the ideal core count for a memory-intensive application (i.e., the number of cores

0 0.2 0.4 0.6 0.8 1N

Frequency Scaling

DMR MM CONS

Figure 3.2. Effect of memory frequency scaling on the performance of three applications. For eachapplication, the results are normalized with respect to the highest IPC observed in that application.

beyond which we do not gain significant performance benefits), we could turn off the remaining

cores to save power without hurting performance significantly, and potentially improve it in some

applications such as BFS (see Figure 3.1).

To illustrate the effect of memory bandwidth on application performance, we vary the mem-

ory frequency, and in turn the corresponding memory supply voltage. Figure 3.2 shows the

application performance in three representative applications (DMR, MM and CONS) exhibit-

ing different behaviors. The performance of CONS, which is a memory-intensive application,

increases linearly with increasing memory V/F. On the other hand, DMR does not have many

memory transactions, and the thread-level parallelism available in the application provides suffi-

cient latency tolerance, and thus, it is not sensitive to the memory speed. MM exhibits a behavior

similar to CONS when memory V/F is increased up to certain point.

Further, an increase in memory bandwidth does not affect MM’s performance, as the ap-

plication has enough latency tolerance, and its behavior resembles that of DMR. Therefore, by

figuring out the suitable V/F for the memory, one could potentially reduce its energy consump-

tion without hurting the performance. Because the abundantly available thread-level parallelism

(TLP) in GPUs provides latency tolerance, reducing the memory frequency should not affect the

application performance significantly as long as there are enough threads to keep the cores busy.

Thus, from a performance angle, voltage/frequency scaling at the memory level can be a suitable

knob for compute-intensive applications.

SSSP BFS SP Stencil HIST MST PVC Kmean

Optimal-Config-For-Application Optimal-Kernel-Based

Figure 3.3. Energy consumption of select benchmarks with respect to the baseline. Each bar is normal-ized with respect to the energy consumption of the corresponding application in our base configuration(i.e., without any energy optimization).

Take-away points: Taking all these factors into account, we conclude that (1) memory-

intensive applications can underutilize the computing cores, and (2) compute-intensive applica-

tions can underutilize the available memory bandwidth. Therefore, finding the ideal number

of active cores for memory-intensive applications and the ideal memory frequency/voltage for

compute-intensive applications is imperative in order to achieve a system with lower energy

consumption.

3.2 Quantifying Resource Underutilization in GPUs at the Ker-

nel Granularity

For a more detailed analysis of the effects of reconfiguring hardware resources on application

performance and energy consumption, we investigate our applications at a finer granularity, i.e.,

at the kernel level. Each GPGPU application consists of one/multiple kernel(s), each of which

is launched once/multiple-times during the execution of that application. Different kernels that

belong to the same application might exhibit a large variance in their resource demands, leading

to different resource utilizations across different phases of the application.

Figure 3.3 shows the normalized energy consumption of 8 memory-intensive applications with

respect to the baseline. First, for each application, we varied the number of cores between 1 and

1 6 11 16 21 26 31

Number of Cores

Kernel_1 Kernel_2 Kernel_3 Kernel_4

Figure 3.4. Effect of increasing the number of cores on the performance of different kernels from MST.

32, and executed them with a different core count, fixed throughout the whole execution. The

first bar in the figure shows the lowest energy consumption (while maintaining approximately

the same performance as the baseline). Next, we varied the number of cores for each individual

kernel, but observed that the optimal core count that provides the lowest energy consumption

for each kernel is different. The second bar shows the energy consumption when each kernel is

executed with its corresponding optimal core count (again, under the same performance as in

the base case).

One can see from these results that, for four applications (HIST, MST, PVC, Kmean), the

optimal configuration at the kernel granularity provides lower energy consumption than that

of the optimal configuration at the application granularity. This is because these applications

consist of multiple kernels, and each kernel has a different resource requirement. On the other

hand, there is no difference between application-level optimization and kernel-level optimization

for the applications that have only 1 kernel executed many times. This figure shows that (1) we

can reduce energy consumption without hurting performance by changing the number of cores,

and (2) modulating the number of cores at the kernel-granularity is better than doing so at the

application-granularity. We are not aware of any application-based approach that dynamically

determines the best GPU configuration, and the results reported in Figure 3.3 for application-

based approach is collected by analyzing statistical data which represents the best hypothetical

application-based approach.

Sequence of Launched Kernels

Kernel_1 Kernel_2

Figure 3.5. The two most important kernels in PVC that exhibit different bandwidth utilizations.

Next, we examine two applications with multiple kernels, MST and PVC, in more detail

and explain why the kernel-level optimum provides better energy savings than the application-

level optimum. Figure 3.4 shows the effect of the number of cores on the performance of four

different kernels from MST. Each of these kernels is executed many times during the course

of execution of this application. Considering the performance scalability of a kernel as the

number of cores is varied, Kernel-4 can be classified as compute-intensive while the other three

kernels are relatively memory-intensive. Moreover, each of these memory-intensive kernels has a

different saturation point where the performance does not increase with more compute resources.

Similarly, Figure 3.5 shows two different kernels of PVC. We observe that, from the perspective

of memory bandwidth utilization, Kernel-2 can be considered as compute-intensive and Kernel-1

as memory-intensive. Figure 3.4 and Figure 3.5 both point to the fact that it might not always be

the case that an application exhibits a similar behavior in terms of its resource demands during

its entire execution. Instead of classifying applications as memory- or compute-intensive based

on the whole execution, doing it so based on the kernel-granularity information might give us

a more accurate picture of their behavior. In fact, we observe that some GPGPU applications

have both compute-intensive and memory-intensive kernels. Therefore, in order to reduce energy

consumption of an application, while it is better to employ core shut-down for a particular kernel,

it might be better to employ memory DVFS for another kernel.

Take-away point: Due to significantly different resource demands of different kernels of an

BFS HIST-1 HIST-2 HIST-3 MST SP SSSP Stencil DMR SPMV-S

Figure 3.6. IPC of kernels over different invocations. Different invocations of the same kernel providevery similar performance.

application, hardware reconfiguration strategies should operate at a kernel-granularity.

3.3 Kernel-level Properties of GPGPU Applications

By analyzing the launched kernels during the execution of different applications, we observe two

important properties: First, most of the kernels are launched multiple times. Second, those ker-

nels that are launched multiple times exhibit a very consistent behavior (IPC, DRAM bandwidth

utilization, etc.) in different invocations. Figure 3.6 shows the stability of select kernels across

different invocations. We collect IPC of kernels that are executed multiple times and normalize it

to the median. It can be seen that, for each kernel, it’s IPC in different invocations is distributed

within a 2% distance from the median IPC. Based on this figure, we can safely assume that a

kernel would exhibit similar behavior in its different invocations. We will talk about this margin

more in Section 4.1. Note also that, we observed similar stability for the memory bandwidth

utilization of a kernel in different executions. It should also be mentioned that, across all the

applications we analyzed in this work, we encountered two applications which potentially benefit

from using fewer active SMs, but the target kernel does not show a stable IPC over different

launches. In Section 4.3.2, we further analyze such outlier scenarios, and explain how we handle

Take-away point: Motivated by these two important observations, the kernel-granularity

information that we can use in our strategy can be collected from that kernel’s first few executions.

We can use this information to figure out what the optimal configuration (number of active SMs

and memory V/F) for that kernel is. After finding this optimal configuration, we can execute

the remaining invocations using the optimal configuration.

3.4 Classifying GPGPU Kernels Dynamically at Runtime

Each GPU thread has a certain memory bandwidth demand. As we increase the number of

SMs, the number of concurrently-running threads increases, and as a result, we have a higher

bandwidth demand. To investigate the effect of increasing compute resources on the memory

system, we monitor thememory bandwidth utilization (MBU), which is equal to the DRAM cycles

in which a read or write request is served, divided by the total DRAM cycles. Intuitively, we

expect to see performance improvement as we increase the number of SMs, as long as we do not

saturate the available memory bandwidth (theoretically, MBU = 100%). However, we observed

in our experiments that the performance of the memory-intensive kernels gets saturated when

MBU is much less than 100%. For instance, in application SP, we observe that using all the SMs

leads to about 55% memory bandwidth utilization. If we keep reducing the number of active SMs

down to 11, we still observe the same IPC and MBU. Note that, the statistical model reported

in [3] may not capture this behavior because they assume that as long as MBU is less 100%,

increasing the number of SMs will improve the performance. Similarly, we studied the impact

of memory V/F scaling on a wide variety of GPGPU applications, and noticed that only kernels

with very low MBU are not affected negatively by scaling.

To find the optimal hardware configuration for a kernel, we need an approach to identify if a

kernel is memory- or compute-intensive dynamically during runtime. Based on our preliminary

experiments, we find that 40% memory bandwidth utilization is a reasonable threshold to classify

a kernel as memory-intensive. Therefore, if the MBU of a kernel is above this number, we consider

that kernel as memory-intensive, meaning that this kernel is potentially using too many SMs.

Similarly, based on our preliminary experiments on memory V/F scaling, we consider a kernel

with memory bandwidth utilization less than 10% to be compute-intensive, meaning that we can

potentially employ DVFS on memory without affecting performance significantly.

Sensitivity: Note that, the main goal of our classification is just to identify kernels that are

amenable to core turn-off and memory scaling. We found our two thresholds (40% and 10%) to

be very accurate on capturing memory/compute intensive kernels. In fact, in our experiments, we

never encountered a kernel with an MBU of less than 40%, which can gain from using less number

of active SMs. Besides, we also tested the 50% and 60% thresholds for detecting memory-sensitive

kernels. We observed that, if we use 50% MBU threshold, we lose some of the opportunities

that could have been exploited to save energy in HIST and PVC applications. On the other

hand, if we use 60% MBU threshold, we would lose most of the opportunities over all the

applications. Therefore, by using the 40% memory bandwidth utilization as our threshold, we

could detect all the kernels which potentially should use less number of active SMs. Similarly,

our sensitivity experiments showed that 10% bandwidth utilization is a good threshold to tag a

kernel as compute-intensive.

Take-away point: We can classify a kernel as compute- or memory-intensive at the end of

its first execution. This information can be used to modulate the number of cores or memory

V/F to reduce the energy consumption of the system.

Chapter 4 |

Kernel Based GPU Resource Man-

agement

In this section, we describe our strategy to find the ideal SM count and the memory V/F con-

figuration for each kernel. The first time a kernel is launched by the CPU, we allocate all the

available SMs to that kernel and set the memory to its highest V/F. When the execution of this

kernel is over, we collect the IPC and MBU of that kernel. If the kernel is recognized as memory-

intensive we employ our Core-Side Energy Optimization Scheme (CSS) to find the ideal number

of active SMs for that kernel, and if it is classified as compute-intensive we use our Memory-Side

Energy Optimization Scheme (MSS) to figure out the ideal memory F/V. Otherwise (i.e., 10% <

MBU < 40%), we use all the cores and highest memory V/F for the future invocations, because

the kernel utilizes both compute and memory resources in a balanced way.

4.1 CSS: Core-Side Energy Optimization Scheme

The goal of CSS is to employ a core shut-down mechanism to reduce the energy consumption

of memory-intensive kernels. The key idea behind CSS is to monitor IPC and MBU statistics

at the end of the first few invocations of a memory-intensive kernel to find the ideal number of

active SMs. To achieve this, we propose a binary search on different number of active SMs over

multiple launches. In other words, we already have the IPC and MBU of the kernel when it has

been allocated all the SMs (first invocation).

The second time the same kernel is launched, it will be allocated only half of the available

SMs and then, based on the IPC and MBU of this execution, we will keep performing the

binary search over the next few launches of that kernel to finally reach to the ideal number of

active SMs. Figure 4.1 shows a sample binary search over five executions. The binary search

takes log(Number-of-SMs) steps to find the ideal answer. Our base architecture has 32 SMs;

consequently, after 5 steps we will reach to the ideal point.

In this search process, we compare the IPC of the new configuration with the IPC of the

very first execution that had all the SMs activated. When we compare two IPCs from two

different launches, we accept a small window of error, which means in practice that, as long as

|IPC2− IPC1| ≤ α, we consider them the same. Such a comparison works fine, because as soon

as we cross back the saturation point and step in the linear part of the execution (Figure 3.1),

the new observed IPC will be considerably less than IPC1.

To the best of our knowledge, the current GPUs do not support a per-core power gating

mechanism. However, judging from the multicore domain [18], we expect that future GPGPUs

with more CUDA cores will give support to fine-grain power gating [3, 4]. Considering this, our

goal is to find the ideal number of active SMs and use the power gating mechanism to power off

the rest of the SMs. This way, we would be able to reduce the overall power consumption in

GPUs, and improve energy efficiency. Note that, in GPUs, the static/leakage power contributes

to a considerable portion of the total power consumption. In fact, Leng et al. [5] reports that,

for an NVIDIA GTX 480 card, static power consumption is approximately 59W, which mostly

belongs to the computing cores. As technology nodes continue to shrink, and the number of

CUDA cores on a GPU card keeps increasing, the leakage power will become a more important

32 Cores

16 Cores

8 Cores

24 Cores

12 Cores

4 Cores

10 Cores

14 Cores

11 Cores

IPC = 125

IPC = 125+α

IPC = 100

IPC = 120

IPC = 125+α

Vol0.68

Vol0.75

Vol0.6

IPC = 125

IPC = 110

Vol0.81

IPC = 125+α

(a) (b)

Figure 4.1. Performing binary search on the (a) number of SMs and (b) memory frequency/voltage.

factor in total power consumption [1,19]. Hence, by finding the ideal number of active SMs, not

only we can achieve lower dynamic power consumption, but we can also reduce the leakage power

via per-SM power-gating.

4.2 MSS: Memory-Side Energy Optimization Scheme

The algorithm to find the ideal V/F for the memory is very similar to the one explained for finding

the ideal number of active SMs. After the very first execution of a kernel if we recognize it as

compute-intensive, we employ a binary search to find proper memory V/F over next launches of

that kernel. As we have 7 different pairs of voltage and frequency, it takes 3 steps to reach to the

ideal point (Figure 4.1). Note that, for the memory-intensive kernels, a considerable portion of the

total dynamic power is due to the memory transactions, whereas in compute-intensive kernels

memory system contributes to a small portion of that. Therefore, for the compute-intensive

kernels, even though we scale down the memory V/F, we do not affect the overall dynamic power

consumption much, and the goal behind such scaling is to reduce the memory leakage power

during those underutilized phases.

We assume 7 P-states for our target memory, ranging from about 500MHZ to 2GHZ [5, 6].

We use 45nm predictive technology models [20] to scale the voltage with frequency. Based

these parameters, we scale voltage from 1V to 0.55V. Based on our definition of being compute-

intensive, we scale the memory V/F only if the number of transactions with memory is low,

since our focus is to reduce energy consumption while incurring negligible performance loss.

Our simulations show that for kernels with bandwidth utilization over 10%, scaling the V/F of

memory can considerably affect the performance which is not acceptable in our work. Note that,

we can also apply DVFS technique on memory in a finer granularities which is out of the scope

of this paper [12,13].

4.3 Handling Outlier Kernels

We face two types of outlier kernels among different applications. The first category belongs

to the kernels that are executed only once which mostly can be detected at compile time. The

second category includes the kernels which are executed multiple times but do not experience

consistent behavior over those invocations. Such inconsistency is detected by monitoring MBU

for each execution. In this section, we explain how our proposed CSS and MSS schemes handle

these two types of outliers.

4.3.1 Kernels that are Executed Only Once (Outlier1-Scheme):

For such kernels, we employ sampling within the kernel execution period and treat each sam-

pling window as if it is a kernel execution. This way, we can collect the required statistics and

perform our binary search over the initial sampling periods. We can potentially have two types

of sampling, static-interval and dynamic-interval. The static approach assumes a fixed number

of cycles as the sampling period for all kernels. On the other hand, in dynamic approach, the

sampling period is set based on the behavior of each kernel, dynamically. We noticed that, in

static approach, in order to accurately capture the behavior of the kernel, the sampling period

cannot be short.

To illustrate our point, we selected two applications (MUM and LIB) from this group and, by

analyzing their statistics we observed that, if we use a long enough sampling period, we can apply

our techniques (core shutdown and memory V/F scaling) to such kernels as well. In analyzing

these two applications we used a 100K cycle sampling window which gives us a consistent behavior

over each window. On the other hand, dynamic-interval approach is applicable on kernels which

have many blocks/CTAs too execute. In such kernels, we launch as many CTAs as possible to

the SMs (Num-SMs*Max-CTA-Per-SM) and when they finish we launch the next set of CTAs

to SMs. In such kernels, we can use every set of launched CTAs as a sampling period. The

intuition behind this approach is based on the similar behavior exhibited by the CTAs which

gives us consistent behavior over those intervals. Among LIB and MUM, only MUM has enough

CTAs to use the dynamic-interval strategy.

We could also use this approach in situations where a kernel is executed multiple times in

order to find the ideal point faster; however, if we use the whole kernel execution for monitoring

the behavior, we can make a more accurate decision.

Overhead: An important question to answer is that how can we shut down some of the SMs

in the middle of the execution of a kernel without incurring performance/power overhead caused

by migration/context-switch among SMs? To address this, when we perform the binary search

over successive sampling periods, we pause the SMs instead of shutting them down. After finding

the ideal number of active SMs, we do not assign any more CTAs to the SMs that are identified

to be turned off, instead we keep them paused. When other active SMs finish their CTAs, we

can turn them off and start the paused ones again. Such a pausing approach will not cause any

migration/context-switch overhead among SMs.

Sequence of Launched Kernels

Kernel_1 Kernel_2

Figure 4.2. Transitioning between the compute-intensive(CI) and memory-intensive(MI) states in PVC.

4.3.2 Kernels with Different Behavior over Different Launches (Outlier2-

Scheme):

Among the benchmarks in our experimental suite, we found only two applications (PVC and

PVR) which potentially can use less number of active SMs for some kernel launches but they do

not have similar IPCs over different launches. However, for this type of applications, we observe

a quite stable MBU over consecutive launches. In more detail, we noticed that such programs

have multiple nested loops, which results in having lots of successive calls to a kernel and then

a sequence of successive calls to another kernel, and over again (can be seen in Figure 4.2). We

observe similar behavior within each sequence of successive calls to the same kernel but two

different sequence of calls (to the same kernel) have different behaviors in terms of IPC. These

applications appear to have consistent MBU over successive launches within each loop iteration,

which motivates us to use this metric as our knob instead of IPC.

We observe that some of these kernels exhibit memory-intensive behavior for tens of successive

launches and then switch to compute-intensive state. Figure 4.2 plots such transitions for kernel-

2. Therefore, when we optimize the number of SMs for memory-intensive periods, we need to

keep monitoring the MBU. In case we observe a sudden decrease in MBU, that means we are

facing a transition to compute-intensive state and we need to activate all the SMs and potentially

reconfigure the memory system to a lower V/F. A similar scenario can happen for transitioning

back to the memory intensive state from compute-intensive state. In Figure 4.2, kernel-1 is always

in the memory-intensive mode. On the other hand, kernel-2 switches between the memory- and

compute-intensive states, which is captured by monitoring its MBU.

We observed that, if a kernel is in saturated mode, reducing the number of SMs does not affect

the memory bandwidth utilization, as long as we are in saturated region. Knowing this property,

we can use the MBU as our knob instead of IPC for this type of applications, meaning that we

can keep reducing the number of cores as long as we observe the same bandwidth utilization.

Besides, we noticed that for this class of applications the threshold we defined for recognizing

memory-intensive kernels seems to be very accurate for these applications. Therefore, if we keep

decreasing the number of SMs, as long as MBU does not drop below that threshold, we do not

lose noticeable performance.

4.4 Putting it Together

Algorithm 1 represents our high-level strategy. The IdealConfig function represents the ideal

number of SMs and ideal memory V/F for a kernel. The Iteration function represents the config-

uration of a GPU through each step of CSS/MSS. After the execution of the kernel, UpdateStatus

function collects the required statistics. All these logic is implemented in the Logic-Unit, shown

in Fiugre 4.3. As can been seen in Algorithm 1, we can either ignore outliers and let them run

under the base configuration or use the proposed schemes to improve them as much as possible.

4.5 Microarchitecture and Hardware Overhead

Figure 4.3 illustrates the high level view of the proposed architecture. After the execution of a

kernel, the logic unit collects the required statistics and determines the next step in the binary

search process. After reaching to the ideal configuration, the logic unit sets the number of active

Algorithm 1 Pseudo code representing high-level strategyif (KernelID.IdealConfig() = NotFound) then

if (KernelID.SingelCall()) then//Apply Outlier1-Scheme

elseGPUConfig -> KernelID.Iteration();KernelID.UpdateStatus();if (KernelID.Monitor() = Outlier2) then//Apply Outlier2-Scheme

end ifend if

elseGPUConfig -> KernelID.IdealConfig();

end if

SMs and memory frequency/voltage of the GPU to its ideal configuration for future invocations

of that kernel.

In our approach, we store the IPC and MBU of each individual kernel. These numbers will be

used during the binary search and also, after reaching to the ideal point we need to store that ideal

configuration for future invocations. Therefore, for each kernel we need to store four variables:

IPC of the first execution (floating point), memory bandwidth utilization of the first execution

(floating point), optimal number of active SMs (5bits), and optimal memory frequency/voltage

(3bits). Therefore, each kernel needs a 9-byte storage to keep its information. In our experiments,

Memory Controller

#Trans

L2 Cache

IPC(32 bits)

MBU(32 bits)

Core-Config(5 bits)

Mem-Config(3 bits)

Fetch/Decode

Scoreboard

... ...

... ... ...

Logic Unit

Kernel

Launch

#Cycle

Shader Core

Figure 4.3. Microarchitecture design of the proposed techniques.

we did not encounter an application with more than 20 kernels. Therefore, even if we consider

such scenario we need a table with 20 entries which is equal to 180 bytes. Note that, we do not

need to be worry about the applications with too many kernels. Because not all these kernels

will be launched multiple times; consequently, if we apply a simple LRU technique to manage

the content of this table, the kernels which are executed only once will be evicted from the table

automatically.

Figure 4.3 shows the microarchitecture design of the proposed dynamic kernel-based recon-

figuration technique. We also assumed that, each SM has 2 counters (overall, 32*2*4 B) to track

the number of executed instructions and number of core cycles, in order to calculate the IPC.

Besides, each memory channel needs 2 counters (overall, 6*2*4 B) to track the number of memory

transactions and number of memory cycles, in order to calculate the MBU. Therefore, our tech-

nique has an overall hardware overhead of 484 bytes. We use these counters for each individual

SM/channel to make it adaptable to multiple-kernel case which is explained in Section 5.5.

Chapter 5 |

Experimental Results

5.1 Methodology

Platform: In order to evaluate our proposal, we used GPGPU-Sim v3.2.1 [24]. The details

of the simulated configuration are listed in Table 5.1. This configuration is similar to GTX480

configuration. In our experiments, we changed the number of active SMs between 1 and 32, and

used 32 in our baseline. We also used 7 different memory voltage values, between 0.55 V and

1 V, with equal intervals. Corresponding to these voltage values, we used 7 different memory

frequency values, between 500 MHz and 2 GHz. Our baseline uses 1 V and 2GHz.

Benchmarks: Table 5.2 lists the applications we used in our evaluations. We consider a wide

range of applications from various benchmark suites: CUDA SDK [25], Rodinia [26], Parboil [27],

Mars [28], Shoc [29], and LonestarGPU [30]. We classify these applications/kernels as compute-

intensive (Type-C), memory-intensive (Type-M), or outlier (Type-I).

Performance Metrics: In this work, we focus on energy efficiency, thus we report three metrics.

First, we report application performance in terms of normalized IPC with respect to the baseline

configuration described in Table 5.1. Second, we report the power consumption of the system

using GPUWattch [5]. In particular, we focus on dynamic power, leakage power, and DRAM

Table 5.1. Baseline configuration. We normalize the results with respect to the number of cores andmemory voltage/frequency configuration denoted in bold.

SM Config. 1-32 Shader Cores, 1400MHz, SIMT Width = 32GPU Resources / Core Max.1536 Threads (48 warps, 32 threads/warp),

48KB Shared Memory, 32684 RegistersCaches / Core 16KB 4-way L1 Data Cache, 12KB 24-way Texture,

8KB 2-way Constant Cache, 2KB 4-way I-cache,128B Line Size

L2 Cache 128 KB/Memory Partition, 128B Line Size, 8-way700MHz

Default Warp Scheduler Greedy-then-oldest [21]Features Memory Coalescing, Inter-warp Merging,

Immediate Post Dominator [22]Interconnect Crossbar, 1400MHz,

32B Channel WidthMemory Model 6 GDDR5 MCs, 500MHZ-2GHZ, 0.55-1 V

FR-FCFS [23], 8 DRAM-banks/MCGDDR5 Timing tCL = 12, tRP = 12, tRC = 40, tRAS = 28, tCCD = 2,

tRCD = 12, tRRD = 6, tCDLR = 5, tW R = 12

Table 5.2. List of GPU benchmarks: Type-M: applications with MBU > 40%. Type-O: outlier appli-cations. Type-C: applications with MBU < 10%. In the last column, M and C refers to memory- andcompute-intensive kernels

# Suite Application Abbr. Type Kernel1 Lonestar Single-Src Shortest Paths SSSP Type-M M2 Lonestar Breadth-First Search BFS Type-M M3 Lonestar Survey Propagation SP Type-M M4 Lonestar Minimum Spanning Tree MST Type-M M-C5 Parboil Saturating Histogram HIST Type-M M-C6 Shoc 2D Stencil Computation stencil Type-M M7 SDK MUMerGPU MUM Type-O M8 SDK LIBOR Monte Carlo LIB Type-O M9 Mars Kmeans Clustering Kmean Type-O M-C10 Mars Page View Count PVC Type-O M-C11 Lonestar Delaunay Mesh Refinement DMR Type-C C12 Rodinia Cardiac Myocyte MYO Type-C C13 Shoc Sparse Matrix Vector Multi. SPMV Type-C M-C14 Shoc Lennard-Jones potential MD Type-C C15 Parboil Matrix Multiplication MM Type-C C16 Mars Page View Rank PVR Type-C M-C

power. Third, based on performance and power results, we calculate the energy consumption

of the system. The results presented below includes all the runtime overheads brought by our

approach.

2 3 3 3 3

6 7 7 7 7

32 32 32 32 32 32 32 32 32

0 1 2 3 4 5 6 7 8 9 10

Sequence of Invocations

Kernel-1 kernel-2 Kernel-3

MBU-1 ≈ 55% MBU-2 ≈ 50% MBU-3 ≈ 20%

Figure 5.1. Stabilization of binary search (ideal number of SMs) for three different kernels from HISTapplication.

1 1 1 1 1 1

0.75 0.75 0.75

0 1 2 3 4 5 6 7

Sequence of Invocations

Kernel-1 Kernel-2

MBU-1 ≈ 55% MBU-2 ≈ 5%

Figure 5.2. Stabilization of binary search (ideal memory V/F) for two main kernels from PVC appli-cation.

5.2 Dynamism of the CSS and MSS Techniques

Figure 5.1 shows how our CSS technique stabilizes after five kernel invocations. As can be seen,

these three kernels from HIST application exhibit different SM count demands. Among these

three kernels, MBU of kernel-3 is below the defined threshold for memory intensiveness, and it

effectively exploits all 32 SMs. On the other hand, MBUs of kernel-1 and kernel-2 are above that

threshold and CSS allocates them much lesser number of SMs without incurring any performance

loss. Figure 5.2 shows how our MSS technique stabilizes after three steps. As can be observed,

these two kernels from PVC appear to have different memory utilizations. The MBU of kernel-1

is higher than the threshold for compute-intensive kernels. Therefore, our technique does not try

to change the memory V/F for this kernel. On the other hand, the MBU of kernel-2 is less than

that threshold and our technique finds its ideal memory V/F after three steps.

5.3 Evaluation of CSS

Figure 5.3 shows the dynamic power consumption breakdown of the memory-intensive kernels of

different GPGPU applications in the base configuration (i.e., without any energy optimization).

For each of these kernels, memory power consumption contributes to a considerable portion of

the total dynamic power consumption.

SM Caches Shared Memory Register File

Cores L2, MC, DRAM NOC

Figure 5.3. Dynamic power consumption breakdown across memory-intensive kernels (baseline, withoutany energy optimization).

Static and Dynamic Power Consumption: Figure 5.4 reports the portion of the total

energy that belong to the static and dynamic power consumption. As can be seen, for some

applications only the static power consumption is improved. As we keep decreasing the number

of active SMs, for kernels running in the saturated region, we potentially decrease the dynamic

power consumption in two ways: First, the kernel will experience less resource contention during

the execution [2]. Second, if the kernel has already incurred some performance loss, it will

experience a shorter execution time by only using the ideal number of SMs. As can be seen in

Figure 5.7, SSP and BFS applications, which had a considerable performance degradation in the

base configuration, gain most in terms of dynamic power saving.

On the other hand, after finding the ideal number of SMs, we power-gate the idle ones.

Therefore, during the course of execution, we have less number of active SMs, which helps in

lowering the static power consumption. Therefore, the energy saving reported for static power

consumption in Figure 5.4, is linearly dependent on the number of power gated SMs.

It is also to be noted that, among these applications, our technique appears to be less effective

on MST and Stencil. Figure 5.5 shows the performance of these two applications as we vary the

number of SMs from 10 to 48 (note that, we have only 32 SMs in our base configuration). As can

be observed, MST and Stencil go into the saturated region beyond 30 SMs. The reason why our

technique is less effective for these two applications compared to the other applications is that,

with around 32 SMs, these two applications have just crossed the saturation point; therefore, the

scope for reducing energy consumption is not much.

rmalized

Dynamic Static

Figure 5.4. Energy saving gained by using optimal number of SMs and power gating the rest of SMs.

10 20 30 40 50

Number of cores

MST Stencil

Figure 5.5. Analyzing the saturated region for MST and Stencil.

Performance: Figure 5.6 gives the average number of SMs CSS stabilized at, as the ideal

point, compared to the optimal number of SMs, which is calculated offline. Figure 5.7 shows

SSSP BFS SP HIST MST Stencil MUM LIB

Base-Configuration Our-Configuration Optimal-Configuration

Figure 5.6. Average number of cores.

the impact of our technique on performance. As can be observed, the first four applications

experience severe performance loss if we allocate them all the available SMs. Our technique

improves the performance of these applications by 12% on average. In addition to the memory

bandwidth saturation problem, these applications also suffer from high resource contention among

concurrently-running threads. We observed that, in the case of SSSP and BFS, the L2 cache

miss-rate increases from 32% and 45% (for optimal number of SMs) to 78% and 77% (for 32

SMs), respectively. Our CSS technique degrades the performance of the remaining applications

by less than 2%.

Figure 5.7. Normalized IPC values for different applications with respect to the baseline.

Outlier Kernels: MUM and LIB belong to the first class of irregular applications (Sec-

tion 4.3.1) where a kernel is executed only once. On the other hand, PVC and Kmean belong to

the second class of outlier applications (Section 4.3.2).

Note that, in this paper our focus is on kernels which are executed multiple times and along

with that, we explained how those techniques can be applied to the outliers. In this line, we

can recognize the kernels which will be executed only once at compile time. On the other hand,

intense traditions in MBU tells us that we are facing the second class of outliers. After recognizing

the outliers, we can either ignore them or apply the proposed customized techniques on them.

Summary of Results: Based on the experimental results, CSS achieves up to 35% and

on average (harmonic-mean)about 20% energy saving, which is within 8% of the optimal saving

(Figure 5.4). This technique improves the performance of 4 applications 12% on average, and

leads to a performance loss of 2% on average for other applications.

5.4 Evaluation of MSS

Since our goal is to optimize energy consumption without incurring performance loss, we do

not attempt to employ any DVFS technique on memory unless the kernel has few memory

transactions. Figure 5.8 shows the effect of employing DVFS technique on memory, for different

GPGPU applications. Note that, over all these applications, MSS technique is as accurate as the

optimal configuration. In compute intensive kernels, memory power consumption contributes

to a small portion of total dynamic power consumption because there are not many memory

transactions. For such kernels, the main portion of memory power consumption is its leakage

power which can be reduced by lowering the voltage. The impact of such DVFS technique on the

total power consumption is not much but from the memory point of view, it is a considerable gain.

As can be seen in Figure 5.8, the first three application have the highest reduction in leakage

power. These applications basically have very few memory transactions and our technique scales

the V/F of the memory to the lowest point available.

Transitions between compute- and memory-intensive states: The last two bars (PVR

and PVC) in Figure 5.8, exhibit a transient behavior over time and only during compute-intensive

phases MSS can apply DVFS technique on memory. Therefore, MSS technique has less oppor-

Figure 5.8. Normalized memory leakage power for different application with respect to the baseline.

tunity to reduce memory leakage power for these two applications. As can be seen in figure 4.2,

the MBU of kernel-1 from PVC is always high and we cannot apply memory DVFS technique on

that. On the other hand, the MBU of kernel-2 is below 10% over CI phase and more than 50%

during MI phase. Our algorithm dynamically detects such transitions between compute- and

memory-intensive phases. For instance, during the CI phase in figure 4.2, Kernel-2 is assigned

all the 32 SMs and its memory voltage is reduced to half, while in MI phase its memory V/F is

set to the highest and the number of SMs is set to 25.

5.5 Running Multiple Applications Concurrently

In this subsection, we discuss how our approach works in a multi-application setting. Most of

the prior works on multi-kernel execution [1, 31, 32] assign different kernels to different SMs,

assuming half and half partitioning of the SMs among kernels. Based on this approach, our

technique monitors the MBU of each individual kernel as well as the overall MBU, in order to

separately find ideal number of SMs for each kernel. After finding the ideal number of SMs for a

memory-intensive kernel, this approach might decide to assign more SMs to the other kernel if it is

recognized as compute-intensive. In this approach, we can decide to shut down some of the SMs if

our scheme detects that none of the concurrently executing kernels will benefit from the additional

SMs. Note that, the microarchitecture reported in Section 4.5 already supports this mechanism.

On the other hand, our memory-side technique is based on memory-related statistics; not based

on the number of concurrently executing kernels, meaning that we decide to apply DVFS on

memory if we are sure that none of the concurrently running kernels is memory intensive. As

an example, we picked a memory-intensive (GUPS) and relatively compute-intensive (JPEG)

application and ran them together on a 32-core system, assuming an even partitioning of the

SMs across the applications. We observed that the MBU of GUPS and JPEG is around 60% and

15%, respectively. Therefore, our algorithm recognizes GUPS as memory-intensive and using

CSS it finds its ideal number of SMs which is about 7 SMs (3% performance loss) instead of 16

SMs. Since we tag JPEG as compute-intensive, we can assign those extra 9 SMs to that kernel,

knowing that it will not affect overall MBU. As a result, the performance of JPEG improves 19%

without noticeable increase in memory traffic.

Chapter 6 |

Related Work

Theoretically speaking, assigning more SMs to a highly multi-threaded application, improves its

performance as long as the memory bandwidth does not saturate. Huang et al. [33] evaluated

the effect of number of active SMs on energy consumption, and discussed that having all the SMs

activated is the most energy efficient configuration. The lack of that study is that they did not

consider any memory-intensive application. In order to have a more accurate analysis, we need

to consider the possible congestion in the interconnection network and the contention in last level

cache caused by enormous number of memory requests (issued by huge number of concurrently

running threads). In this line, Guz et al. [2] showed that increasing the parallelism improves the

performance as long as the memory access latency is not affected considerably. Suleman et al. [15]

proposed a feedback-driven-threading to control the number of concurrently running threads in

CMPs, in order to balance the overhead of data-synchronization and level of parallelism.

There has been some prior works, in both the CPU and GPU domain, that determine the

optimal number of cores for a particular application/kernel in order to improve the performance

or reduce the power consumption. Considering the fact that, any bottleneck on memory side (i.e.,

congestion in the interconnection network, contention in the L2 cache, and saturation of DRAM

memory bandwidth) causes frequent SM stalls, we can categorize the prior works as follows.

Modeling-Based. Li and Martinez [34] analytically estimated the optimal number of pro-

cessors to achieve the best EDP in CMPs. In GPU domain, Hong et al. [3] proposed an analytical

model which predicts the optimal number of SMs offline. This model determines the optimal

number of SMs for a particular kernel based on its memory demand. These models do not cover

the possible congestion in the interconnection network and the contention in last level cache which

might lead to overestimating the number of required SMs under such circumstances. Besides, in

Section 3, we demonstrate that some kernels might exhibit considerably different behaviors over

different invocations as the dynamic parameters, like the kernel’s input, change. The modeling

techniques fail in capturing such variations in run-time behavior.

Throttling-Based. Throttling is an approach for managing the degree of concurrency in a

multi-threaded application. Rogers et al. [?] and Kayiran et al. [17] both employ a warp-throttling

technique for handling the contention in the L1 cache. Thread throttling techniques can reduce

the pressure on the side of memory; consequently, SMs will experience fewer and shorter stall

times because of the smaller data access latencies. In the current work, we adopt a core shut-

down technique which not only adjusts the degree of concurrency, but also reduces the core static

leakage power that is a major contributor to the total chip power consumption [5]. Besides, we

employ a throttling-based technique (i.e., pausing technique) during our search process, and

apply the core shut-down once the ideal number of active SMs is determined.

DVFS-Based. Similar to the throttling-based techniques, DVFS-based techniques can be

used to reduce the occurrence and the length of the SM stall times by lowering the frequency

of the SMs which in turn reduces the number of generated memory requests per unit of time.

Although DVFS techniques are more effective in reducing the SM leakage power compared to

the throttling, the core shut-down approach is yet more effective in that sense. Lee et al. [4]

analyze the impact of DVFS on the SMs as well as the interconnection network under a fixed

power budget. They demonstrate that wisely distributing the power budget could improve the

system performance. Leng et al. [5] employs a DVFS technique for a different purpose. They

propose to reduce the V/F of the SMs only during the SM stall times. Note that, a core-side

DVFS cannot resolve the contention in the L2 cache because cache-contention is a function of the

sequence of the accesses not the time of the accesses. Overall, DVFS techniques are orthogonal

to the throttling. Throttling techniques can reduce the occurrence and the length of the SM stall

times, while a parallel DVFS technique can be applied during the inevitable SM stall times.

Optimal Memory Frequency/Voltage: There are several prior works on power consump-

tion of memory. A category of the prior works on the power consumption of memory systems

tries to reshape the memory traffic in order to keep some channels/ranks/banks in the sleep

mode [35, 36]. These techniques are not applicable to GPUs because, for each individual kernel,

we need to detect and migrate the working data-set, which makes the overhead of data migration

too much. From a different perspective, David et al. [12] employed a DVFS technique to reduces

the power consumption of memory. Following that work, Deng et al. [13] proposed MultiScale

which applies DVFS on the memory systems with multiple controllers. CoScale [?] proposes a

coordinated CPU and memory system DVFS technique for the servers. This coordinated fash-

ion approach is expected to avoid the possible conflicts where CPU and memory system have

separate DVFS controllers. Our proposed memory side DVFS technique is basically a common

DVFS technique which works on the kernel granularity, as opposed to other DVFS techniques.

Chapter 7 |

Conclusions

In this paper, we showed that allocating all the SMs to memory-intensive kernels is not neces-

sarily energy-efficient, and it does not achieves the highest performance either. Based on this

observation, we proposed a technique that dynamically determines the ideal number of active

SMs for each kernel and power gates the remaining SMs. This technique reduces the dynamic and

static power consumption, significantly. More specifically, over a wide variety of GPGPU applica-

tions, it reduced energy consumption up to 36% and about 20% on average. On the other hand,

memory does not necessarily need to be set to its fastest configuration for the compute-intensive

kernels. Therefore, we can apply DVFS technique on memory in case a compute-intensive kernel

is running. We showed that, by scaling the memory voltage for compute-intensive kernels, we can

reduce memory leakage power 70% on average, without affecting the overall system performance.

Bibliography

[1] Kepler, N. “NvidiaâĂŹs next generation cuda compute architecture,” .URL http://www.nvidia.com/kepler

[2] Guz, Z., E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. Weiser (2009)“Many-Core vs. Many-Thread Machines: Stay Away From the Valley,” Computer Architec-ture Letters.

[3] Hong, S. and H. Kim (2010) “An Integrated GPU Power and Performance Model,” inProceedings of the 37th Annual International Symposium on Computer Architecture, ISCA’10.

[4] Lee, J., V. Sathisha, M. Schulte, K. Compton, and N. S. Kim (2011) “Improv-ing Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and CoreScaling,” in Parallel Architectures and Compilation Techniques (PACT), 2011 InternationalConference on.

[5] Leng, J., T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt,and V. J. Reddi (2013) “GPUWattch: Enabling Energy Optimizations in GPGPUs,”SIGARCH Comput. Archit. News.

[6] Fermi, N. “NvidiaâĂŹs next generation cuda compute architecture,” .URL http://www.nvidia.com/fermi

[7] Lee, S. Y. and C. J. Wu (2014) “Characterizing the Latency Hiding Ability of GPUs,”in International Symposium on performance Analysis of Systems and Software (ISPASS),2014, poster.

[8] Abdel-Majeed, M. and M. Annavaram (2013) “Warped Register File: A Power EfficientRegister File for GPGPUs,” in Proceedings of the 2013 IEEE 19th International Symposiumon High Performance Computer Architecture (HPCA), HPCA ’13.

[9] Yuan, G. L., A. Bakhoda, and T. M. Aamodt (2009) “Complexity Effective MemoryAccess Scheduling for Many-core Accelerator Architectures,” in Proceedings of the 42NdAnnual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, ACM.

[10] Singh, I., A. Shriraman, W. W. Fung, M. O’Connor, and T. M. Aamodt (2013)“Cache coherence for GPU architectures,” in High Performance Computer Architecture(HPCA2013), 2013 IEEE 19th International Symposium on.

[11] Hechtman, B. A., S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill,S. K. Reinhardt, and D. A. Wood (2014) “QuickRelease: A Throughput-oriented Ap-proach to Release Consistency on GPUs,” in High Performance Computer Architecture(HPCA2014), 2014 IEEE 20th International Symposium on.

[12] David, H., C. Fallin, E. Gorbatov, U. R. Hanebutte, and O. Mutlu (2011) “MemoryPower Management via Dynamic Voltage/Frequency Scaling,” in Proceedings of the 8th ACMInternational Conference on Autonomic Computing, ICAC ’11, ACM.

[13] Deng, Q., D. Meisner, A. Bhattacharjee, T. F. Wenisch, and R. Bianchini (2012)“MultiScale: Memory System DVFS with Multiple Memory Controllers,” in Proceedingsof the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design,ISLPED ’12, ACM.

[14] Gebhart, M., D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lind-holm, and K. Skadron (2011) “Energy-efficient Mechanisms for Managing Thread Contextin Throughput Processors,” in Proceedings of the 38th Annual International Symposium onComputer Architecture, ISCA ’11, ACM.

[15] Suleman, M. A., M. K. Qureshi, and Y. N. Patt (2008) “Feedback-driven Threading:Power-efficient and High-performance Execution of Multi-threaded Workloads on CMPs,”SIGARCH Comput. Archit. News.

[16] Williams, S., A. Waterman, and D. Patterson (2009) “Roofline: An Insightful VisualPerformance Model for Multicore Architectures,” Commun. ACM, pp. 65–76.

[17] Kayiran, O., A. Jog, M. T. Kandemir, and C. R. Das (2013) “Neither More Nor Less:Optimizing Thread-Level Parallelism for GPGPUs,” in PACT.

[18] Microarchitecture, I. N. .URL http://www.intel.com/technology/architecture-silicon/next-gen/

[19] Kim, N. S., T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. S. Hu, M. J. Irwin,M. Kandemir, and V. Narayanan (2003) “Leakage Current: Moore’s Law Meets StaticPower,” Computer.

[20] technology model, P. .URL http://ptm.asu.edu

[21] Rogers, T. G., M. O’Connor, and T. M. Aamodt (2012) “Cache-Conscious WavefrontScheduling,” in MICRO.

[22] Fung, W., I. Sham, G. Yuan, and T. Aamodt (2007) “Dynamic Warp Formation andScheduling for Efficient GPU Control Flow,” in MICRO.

[23] Rixner, S., W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens (2000) “MemoryAccess Scheduling,” in ISCA.

[24] Bakhoda, A., G. Yuan, W. Fung, H. Wong, and T. Aamodt (2009) “Analyzing CUDAworkloads using a detailed GPU simulator,” in Performance Analysis of Systems and Soft-ware, 2009. ISPASS 2009. IEEE International Symposium on.

[25] NVIDIA (2011), “CUDA C/C++ SDK Code Samples,” .URL http://developer.nvidia.com/cuda-cc-sdk-code-samples

[26] Che, S., M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron(2009) “Rodinia: A Benchmark Suite for Heterogeneous Computing,” in IISWC.

[27] Stratton, J. A., C. Rodrigues, I. J. Sung, N. Obeid, L. W. Chang, N. Anssari,G. D. Liu, and W. W. Hwu (2012) Parboil: A Revised Benchmark Suite for Scientific andCommercial Throughput Computing, Tech. Rep. IMPACT-12-01, University of Illinois, atUrbana-Champaign.

[28] He, B., W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang (2008) “Mars: A MapRe-duce Framework on Graphics Processors,” in PACT.

[29] Danalis, A., G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford,V. Tipparaju, and J. S. Vetter (2010) “The Scalable Heterogeneous Computing (SHOC)benchmark suite,” in GPGPU.

[30] Burtscher, M., R. Nasre, and K. Pingali (2012) “A quantitative study of irregularprograms on GPUs,” in IISWC.

[31] Adriaens, J., K. Compton, N. S. Kim, and M. Schulte (2012) “The case for GPGPUspatial multitasking,” in High Performance Computer Architecture (HPCA), 2012 IEEE18th International Symposium on.

[32] Pai, S., M. J. Thazhuthaveetil, and R. Govindarajan (2013) “Improving GPGPUConcurrency with Elastic Kernels,” SIGPLAN Not.

[33] Huang, S., S. Xiao, and W. Feng (2009) “On the Energy Efficiency of Graphics ProcessingUnits for Scientific Computing,” in Proceedings of the 2009 IEEE International Symposiumon Parallel&Distributed Processing, IPDPS ’09, IEEE Computer Society.

[34] Li, J. and J. F. Martínez (2005) “Power-performance Considerations of Parallel Comput-ing on Chip Multiprocessors,” ACM Trans. Archit. Code Optim.

[35] Huang, H., K. G. Shin, C. Lefurgy, and T. Keller (2005) “Improving Energy Effi-ciency by Making DRAM Less Randomly Accessed,” in Proceedings of the 2005 InternationalSymposium on Low Power Electronics and Design, ISLPED ’05, ACM.

[36] Luz, V. D. L., M. Kandemir, and I. Kolcu (2002) “Automatic Data Migration forReducing Energy Consumption in Multi-bank Memory Systems,” in Proceedings of the 39thAnnual Design Automation Conference, DAC ’02, ACM.

KERNEL-BASED ENERGY OPTIMIZATION IN GPUS

Documents

Analysis-Driven Optimization - Nvidia© NVIDIA 2010 Performance Optimization Process •Use appropriate performance metric for each kernel –For example, Gflops/s don’t make sense

Parallel Programming Intro to GPUs & NVIDIA CUDAmheroux/fall2013_csci317/... · Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Grid 2 Courtesy:

Graph Theory and Optimization Parameterized …Vertex Cover1st FPTParameterized Complexity1st KernelKernelizationLinear Kernel via LPConclusion Graph Theory and Optimization Parameterized

Optimization in Reproducing kernel Hilbert Spaces of Spike ...arpaiva/pubs/2010a.pdf · Optimization in Reproducing kernel Hilbert Spaces of Spike Trains Ant onio R. C. Paivay, Il

Graphics Processing Units (GPUs)csg.csail.mit.edu/6.823/Lectures/L21.pdf · GPU Kernel Execution Transfer input data from CPU to GPU memory Launch kernel (grid) Wait for kernel to

Global Optimization via a Modiﬁed Potential Smoothing Kerneldasher.wustl.edu/ponder/papers/ccb-report-2002-01.pdf · Global Optimization via a Modiﬁed Potential Smoothing Kernel

Kernel bandwidth optimization in spike rate estimation

Optimization Algorithm with Kernel PCA to Support …...Optimization Algorithm with Kernel PCA to Support Vector Machines for Time Series Prediction Qisong Chen College of Computer

Case Study: Kepler K20 GPUs: Synthetic Aperture Radar …on-demand.gputechconf.com/gtc/2013/presentations/S3274... · 2013-03-20 · Optimization Case Study for Kepler K20 GPUs: Synthetic

GPU Performance Analysis and Optimization€¦ · Determining Performance Limiter for a Kernel •Kernel performance is limited by one of: –Memory bandwidth –Instruction bandwidth

On Learning and Optimization of the Kernel Functions …users.cecs.anu.edu.au/~u4940058/msc_thesis.pdf · ON LEARNING AND OPTIMIZATION OF THE KERNEL FUNCTIONS IN SCARCITY OF LABELED

GPU Performance Analysis and Optimization - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S... · Determining Performance Limiter for a Kernel •Kernel performance is limited

Histogram and Kernel Optimization

Gpus graal

Accelerating Boosting-based Face Detection on GPUs...Particularly, the usage of concurrent kernel execution in combination with cascades generated with theGentleBoost algorithm solves

NEW OPTIMIZATION METHODS AND APPLICATIONS IN KERNEL-BASED MACHINE LEARNING · 2010-05-01 · NEW OPTIMIZATION METHODS AND APPLICATIONS IN KERNEL-BASED MACHINE LEARNING By Onur S»eref

Multi-Kernel Auto-tuning on GPUs: Performance and Energy ...€¦ · Multi-Kernel Auto-tuning on GPUs: Performance and Energy-Aware Optimization Jo˜ao Filipe Dias Guerreiro Thesis

Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems

GRIM: Leveraging GPUs for Kernel Integrity Monitoringeliasathan/papers/raid16.pdf · GRIM: Leveraging GPUs for Kernel Integrity Monitoring ... Operating System ... tive runtime libraries

DEBLURRING OPTIMIZATION THROUGH SEPARABLE BLUR KERNELvision.ouc.edu.cn/valse/slides/20141010/LuFang_ppt_VALSE.pdf · DEBLURRING. OPTIMIZATION THROUGH SEPARABLE BLUR KERNEL. 2