6
Fine-Grained Resource Sharing for Concurrent GPGPU Kernels Chris Gregg Jonathan Dorn Kim Hazelwood Kevin Skadron Department of Computer Science University of Virginia PO Box 400740 Charlottesville, VA 22904 ABSTRACT General purpose GPU (GPGPU) programming frameworks such as OpenCL and CUDA allow running individual com- putation kernels sequentially on a device. However, in some cases it is possible to utilize device resources more efficiently by running kernels concurrently. This raises questions about load balancing and resource allocation that have not previ- ously warranted investigation. For example, what kernel characteristics impact the optimal partitioning of resources among concurrently executing kernels? Current frameworks do not provide the ability to easily run kernels concurrently with fine-grained and dynamic control over resource parti- tioning. We present KernelMerge, a kernel scheduler that runs two OpenCL kernels concurrently on one device. Ker- nelMerge furnishes a number of settings that can be used to survey concurrent or single kernel configurations, and to investigate how kernels interact and influence each other, or themselves. KernelMerge provides a concurrent kernel scheduler compatible with the OpenCL API. We present an argument on the benefits of running kernels concurrently. We demonstrate how to use KernelMerge to increase throughput for two kernels that efficiently use de- vice resources when run concurrently, and we establish that some kernels show worse performance when running con- currently. We also outline a method for using KernelMerge to investigate how concurrent kernels influence each other, with the goal of predicting runtimes for concurrent execution from individual kernel runtimes. Finally, we suggest GPU architectural changes that would improve such concurrent schedulers in the future. 1. INTRODUCTION General purpose GPU (GPGPU) computing capitalizes on the massively parallel resources of a modern GPU to enable high compute throughput. Programming languages such as CUDA and OpenCL have provided application de- velopers with a familiar environment with which to exploit GPU capabilities. However, there remain barriers to mass deployment of GPGPU technology, one of which is the in- ability to easily run multiple applications concurrently. In addition, because of architectural differences among devices, an optimization for one GPU may result in worse utilization on other GPUs. Even highly performance-optimized ker- nels may have memory- or compute-bound behavior, leav- ing some resources underutilized. We propose that running another kernel concurrently with this first kernel can take advantage of these underutilized resources, improving over- all system throughput when many kernels are available to run. Concurrent execution provides the additional benefit of exploiting the entire device, even when executing less than fully-optimized kernels. This eases the programmer burden, since optimization requires time that may be profitably re- allocated. Current GPGPU frameworks do not have the ability to run concurrent kernels with fine-grained control over re- sources available to each kernel, and we argue that such a feature will be increasingly necessary as the popularity of GPU computing grows. In this paper, we present our argu- ment and describe KernelMerge, a prototype kernel sched- uler that has this ability. We demonstrate a case study using KernelMerge to find the optimal resource allocation for two kernels running simultaneously, and we show that a naive approach to this static assignment can lead to a degrada- tion in performance over running the kernels sequentially. GPGPU code can provide tremendous speed increases over multicore CPUs for computational work. However, the na- ture of GPU architectural designs can lead to underutiliza- tion of resources: GPU kernels can either saturate the com- putational units on the device (the applications are compute bound ), or they can saturate the available memory band- width (they are bandwidth bound ). Application developers are cautioned to attempt to balance these two concerns so that a GPU kernel performs most efficiently, but this opti- mization can be tedious and in many cases algorithms can- not be rewritten to balance computation and memory band- width equally [1, 5]. Figure 1 shows an example of a com- pute bound kernel. To generate the figure, both the GPU memory clock and the GPU processor clock were indepen- dently adjusted for a Merge Sort benchmark kernel, which uses shared memory to decrease the number of global mem- ory writes, leading to compute-bound behavior. As the fig- ure shows, changing the memory clock frequency does not change the runtime, but changing the processor clock fre- quency does. In other words, the kernel is compute bound. Figure 2 shows an example of a bandwidth bound kernel, Vector Add. In contrast to Figure 1, the bandwidth bound kernel is affected by the memory clock frequency adjust- ments, and not the processor clock adjustments. When a compute bound or bandwidth bound kernel runs alone on the GPU, it may not fully utilize the resources that the GPU has to offer. In the cases where it does not realize the ideal utilization, our goal is to change the ratio of com- pute instructions to memory instructions so that the device is fully utilized. Specifically, when a compute bound kernel runs, the memory controllers may be idle while waiting for the computational units, and when a bandwidth bound ker- nel is running, ALUs and other computational circuits may

Fine-Grained Resource Sharing for Concurrent GPGPU Kernels

  • Upload
    buidung

  • View
    228

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fine-Grained Resource Sharing for Concurrent GPGPU Kernels

Fine-Grained Resource Sharing for Concurrent GPGPUKernels

Chris Gregg Jonathan Dorn Kim Hazelwood Kevin SkadronDepartment of Computer Science

University of VirginiaPO Box 400740

Charlottesville, VA 22904

ABSTRACTGeneral purpose GPU (GPGPU) programming frameworkssuch as OpenCL and CUDA allow running individual com-putation kernels sequentially on a device. However, in somecases it is possible to utilize device resources more efficientlyby running kernels concurrently. This raises questions aboutload balancing and resource allocation that have not previ-ously warranted investigation. For example, what kernelcharacteristics impact the optimal partitioning of resourcesamong concurrently executing kernels? Current frameworksdo not provide the ability to easily run kernels concurrentlywith fine-grained and dynamic control over resource parti-tioning. We present KernelMerge, a kernel scheduler thatruns two OpenCL kernels concurrently on one device. Ker-nelMerge furnishes a number of settings that can be usedto survey concurrent or single kernel configurations, and toinvestigate how kernels interact and influence each other,or themselves. KernelMerge provides a concurrent kernelscheduler compatible with the OpenCL API.

We present an argument on the benefits of running kernelsconcurrently. We demonstrate how to use KernelMerge toincrease throughput for two kernels that efficiently use de-vice resources when run concurrently, and we establish thatsome kernels show worse performance when running con-currently. We also outline a method for using KernelMergeto investigate how concurrent kernels influence each other,with the goal of predicting runtimes for concurrent executionfrom individual kernel runtimes. Finally, we suggest GPUarchitectural changes that would improve such concurrentschedulers in the future.

1. INTRODUCTIONGeneral purpose GPU (GPGPU) computing capitalizes

on the massively parallel resources of a modern GPU toenable high compute throughput. Programming languagessuch as CUDA and OpenCL have provided application de-velopers with a familiar environment with which to exploitGPU capabilities. However, there remain barriers to massdeployment of GPGPU technology, one of which is the in-ability to easily run multiple applications concurrently. Inaddition, because of architectural differences among devices,an optimization for one GPU may result in worse utilizationon other GPUs. Even highly performance-optimized ker-nels may have memory- or compute-bound behavior, leav-ing some resources underutilized. We propose that runninganother kernel concurrently with this first kernel can takeadvantage of these underutilized resources, improving over-all system throughput when many kernels are available to

run. Concurrent execution provides the additional benefit ofexploiting the entire device, even when executing less thanfully-optimized kernels. This eases the programmer burden,since optimization requires time that may be profitably re-allocated.

Current GPGPU frameworks do not have the ability torun concurrent kernels with fine-grained control over re-sources available to each kernel, and we argue that sucha feature will be increasingly necessary as the popularity ofGPU computing grows. In this paper, we present our argu-ment and describe KernelMerge, a prototype kernel sched-uler that has this ability. We demonstrate a case study usingKernelMerge to find the optimal resource allocation for twokernels running simultaneously, and we show that a naiveapproach to this static assignment can lead to a degrada-tion in performance over running the kernels sequentially.

GPGPU code can provide tremendous speed increases overmulticore CPUs for computational work. However, the na-ture of GPU architectural designs can lead to underutiliza-tion of resources: GPU kernels can either saturate the com-putational units on the device (the applications are computebound), or they can saturate the available memory band-width (they are bandwidth bound). Application developersare cautioned to attempt to balance these two concerns sothat a GPU kernel performs most efficiently, but this opti-mization can be tedious and in many cases algorithms can-not be rewritten to balance computation and memory band-width equally [1, 5]. Figure 1 shows an example of a com-pute bound kernel. To generate the figure, both the GPUmemory clock and the GPU processor clock were indepen-dently adjusted for a Merge Sort benchmark kernel, whichuses shared memory to decrease the number of global mem-ory writes, leading to compute-bound behavior. As the fig-ure shows, changing the memory clock frequency does notchange the runtime, but changing the processor clock fre-quency does. In other words, the kernel is compute bound.Figure 2 shows an example of a bandwidth bound kernel,Vector Add. In contrast to Figure 1, the bandwidth boundkernel is affected by the memory clock frequency adjust-ments, and not the processor clock adjustments.

When a compute bound or bandwidth bound kernel runsalone on the GPU, it may not fully utilize the resources thatthe GPU has to offer. In the cases where it does not realizethe ideal utilization, our goal is to change the ratio of com-pute instructions to memory instructions so that the deviceis fully utilized. Specifically, when a compute bound kernelruns, the memory controllers may be idle while waiting forthe computational units, and when a bandwidth bound ker-nel is running, ALUs and other computational circuits may

Page 2: Fine-Grained Resource Sharing for Concurrent GPGPU Kernels

600  

580  560  540  520  50  

52  

54  

56  

58  

60  

62  

64  

66  

1100  1080  

1060  1040  1020  

GPU  Processor  Clock  

(MHz)  

Kernel  Tim

e  (m

s)  

GPU  Memory  Clock  (MHz)  

Figure 1: Merge Sort: a compute bound example.The horizontal axis shows a sweep of the GPU mem-ory clock, which does not change the kernel runningtime. The depth axis shows a sweep of the GPUprocessor clock, and as the frequency is lowered, therunning time is significantly increased.

be forced to wait for memory loads or stores. The premisefor this work is that it is possible to increase utilization of theentire device by judiciously running multiple concurrent ker-nels that will together utilize the available resources. UsingKernelMerge we saw up to 18% speedup for two concurrentkernels over the time to run the same kernels sequentially.

2. KERNELMERGE

2.1 OverviewThe goal of KernelMerge is to enable concurrent kernel

execution on GPUs, and to provide application developersand GPGPU researchers a set of knobs to change schedulerparameters. Consequently, KernelMerge can be used to tuneperformance when running multiple kernels simultaneously(e.g., to maximize kernel throughput), and as a platformfor research into concurrent kernel scheduling. For example,KernelMerge can set the total number of workgroups run-ning on a GPU, and can set a fixed number of workgroupsper kernel in order to dedicate a percentage of resources toone or both kernels. By sweeping through these values andcapturing runtimes (which KernelMerge provides), develop-ers can determine the best parameters for their kernels, orresearchers can see trends that may allow runtime predic-tions.

KernelMerge is a software scheduler, built in C++ andOpenCL for the Linux platform, that meets the need forsimultaneous kernel execution on GPUs without overly bur-dening application developers with significant code rewrites.The C++ code “hijacks” the OpenCL runtime, providing itsown implementation of many OpenCL API calls. For ex-ample, a custom implementation of clEnqueueNDRangeKer-nel, which dispatches kernels to the device, uses the currentscheduling algorithm to determine when to invoke the ker-nel instead of the default OpenCL FIFO queue. The call-ing code is unchanged, using the same API call with thesame arguments but gaining the new functionality. In factmost host code can take advantage of the scheduler with nochanges save a single additional #include declaration.

KernelMerge includes additional features that are of in-

600  

580  

560  540  520  7  

7.2  

7.4  

7.6  

7.8  

8  

8.2  

8.4  

1100  1080  

1060  1040  1020  

GPU  Processor  

Clock  (M

Hz)  

Kernel  Tim

e  (m

s)  

GPU  Memory  Clock  (MHz)  

Figure 2: Vector Add: a memory bandwidth boundexample. The horizontal axis shows a sweep of theGPU memory clock, and as the frequency is lowered,the kernel running time increases. The depth axisshows a sweep of the GPU processor clock, whichdoes not affect the running time.

terest to GPGPU developers and researchers. It has theability to determine the maximum number of workgroups oftwo kernels that will run at once on a device, which can in-fluence the partitioning scheme that the scheduler uses, andcan also be useful when analyzing performance or algorithmdesign.

We envision KernelMerge evolving from its current formas mainly a research tool into a robust scheduler that couldbe implemented by an operating system for high-throughputGPGPU scheduling. We developed KernelMerge to allowus to run independent OpenCL kernels simultaneously ona single OpenCL-capable device. KernelMerge also has theability to run single kernels by themselves, which allows mea-surement of scheduler overhead and also enables using thescheduler settings to investigate how a single kernel behaves.

At the highest level, KernelMerge combines two invoca-tions of the OpenCL enqueueNDRangeKernel (one from eachapplication wishing to run a kernel) into a single kernellaunch that consists of the scheduler itself. In other words,as far as the GPU is concerned, there is a single kernel run-ning on the device. Before launching the scheduler kernel,the scheduler passes the individual kernel parameters, andbuffer, workgroup, and workitem information to the device,which the scheduler uses to run the individual kernels on thedevice.

The scheduler creates a parametrically defined numberof “scheduler workgroups,” that the device sees as a singleset of workgroups for the kernel. The scheduler then dis-patches a workgroup (a “kernel workgroup”) from the indi-vidual kernels to each scheduler workgroups. There are twomain scheduling algorithms that we have developed: workstealing and partitioning by kernel, described in Section 2.2.The scheduler workgroups execute their assigned kernels asif they were the assigned kernel workgroup, using a tech-nique we call “spoofing.”

OpenCL kernels rely on a set of IDs to determine theirrole in a kernel algorithm. Because a kernel workgroupcan get launched into any scheduler workgroup, the realglobal, workgroup, and workitem IDs will not correspond tothe original IDs that the kernel developer had designated.

Page 3: Fine-Grained Resource Sharing for Concurrent GPGPU Kernels

Function Returns

get global id() The unique global work-item IDget global size() The number of global work-itemsget group id() The work-group IDget local id() The unique local work-item ID

get local size() The number of local work-itemsget num groups() The number of work-groups that

will execute a kernel

Table 1: Workitem and workgroup identificationavailable to a running OpenCL kernel.

Therefore, when the scheduler dispatches the kernel work-groups into the scheduler workgroups, it must “spoof” theIDs so that the kernel workgroup sees the correct values.One benefit to the spoofing is that the technique transpar-ently allows us to concurrently schedule a kernel that uses alinear workgroup with another that uses a two-dimensionalworkgroup. This spoofing is responsible for a significant por-tion of the scheduler overhead, as discussed in Section 2.4.Table 1 shows the ID information that the scheduler spoofs.

Once the spoofing is established, the scheduler“dispatches”the workgroups which carry out the work they have been as-signed. The scheduler continues to partition out the workuntil no work remains, at which point the scheduler (andthus the kernel that the device sees) completes.

The scheduler collects runtime data for the scheduler ker-nel, and can report on the runtime for the overall kernel.Unfortunately, because the GPU sees the scheduler as a sin-gle monolithic kernel, the scheduler cannot obtain runtimesfor the individual kernels.

2.2 Scheduling AlgorithmsThere are two different scheduling algorithms that Ker-

nelMerge utilizes. The first is a round-robin work-stealingalgorithm that assigns work from each kernel on a first-come,first-served basis. The second algorithm assigns a fixed per-centage of workgroups to each individual kernel.

Figure 3 shows a graphical representation of the work-stealing algorithm on the left, and the static assignmentof a fixed percentage of scheduler workgroups to each ker-nel on the right. For the work-stealing algorithm, Kernelworkgroups are assigned alternating locations in a queue.If there are more workgroups of one kernel, the extra work-groups occupy consecutive positions at the end of the queue.Scheduler workgroups are assigned the kernel workgroups ina first-come, first-served fashion, so that at the beginninghalf the scheduler workgroups are running one kernel andthe other half are running the other kernel. As each sched-uler workgroup finishes, it is re-assigned more work fromthe head of the queue. With this algorithm, the allocationof resources to each kernel is not fixed, as it depends onthe run time of each kernel. Once one kernel has finishedrunning, all of the scheduler workgroups will be running theremaining kernel workgroups until it, too, completes.

In the partitioning algorithm, when a scheduler workgroupfinishes, it is only reassigned work from one kernel. Thisensures that the percentage of device resources assigned toeach kernel remains fixed. This algorithm runs as few as oneworkgroup of the first kernel while dedicating all remainingdevice resources to workgroups of the second kernel (andvice-versa). This ability is one of the most powerful settings

1" 2" 3" 4" 5" 6" 7" 8"

2"1"

2"3"

4"1"

4" 6"

9"

5" 3"

5"

6"7" 8"

9"

7" 8" 9"

Tim

e

WG1

WG2

WG3

WG4

WG1

WG2

WG3

WG4

(a) Work-stealing Algorithm

1" 2" 3" 4" 5" 6" 7" 8"

2"1"

2"3"

4"1"

4" 6"

9"

5" 3"

5"

6"7" 8"

9"

7" 8" 9"

Tim

e

WG1

WG2

WG3

WG4

WG1

WG2

WG3

WG4

(b) Partitioning Algorithm

Figure 3: Work-stealing –vs– partitioning. Theblack boxes represent one kernel and the whiteboxes represent the other. In (b), one-fourth ofthe available scheduler workgroups are assignedthe short-running kernel (1,3,5), and the remain-ing scheduler workgroups are assigned to the long-running kernel. As scheduler workgroups finish,work is replaced from their assigned kernel.

available to KernelMerge, allowing high precision tuning ofdevice resources assigned to each kernel. For example, ifa highly memory-bound kernel is run concurrently with ahighly compute-bound kernel, the percentage of device re-sources can be adjusted so that the memory-bound kernelis assigned just enough such that it is barely saturating thememory system, and the rest of the resources can be ded-icated to running the compute bound kernel, maximizingthroughput for both kernels.

2.3 ResultsFigure 4 shows a comparison between pairs of kernels that

were run sequentially and were also run concurrently usingKernelMerge. The figure shows that 39% of the paired ker-nels showed a speedup when run concurrently. However,naively scheduling kernels to run together can be highlydetrimental; in some cases running two kernels concurrentlyleads to over 3x slowdown. These results provide a furthermotivation for the use of KernelMerge as a research tool:being able to predict when two kernels will positively ornegatively affect the concurrent runtime is non-trivial. Thesettings provided within KernelMerge have the potential tohelp identify general conditions that result in either benefi-cial or adverse concurrent kernel parings, which can in turnlead to the ability to predict runtimes and therefore makeinformed decisions about whether or not to run kernels con-currently. We defer the results of the partitioning algorithmto section 3 where we show a case study using this algorithm.

2.4 Limitations and Runtime OverheadAs with any runtime scheduler, there are overhead costs

to running kernels with KernelMerge. Figure 5 shows theoverhead of running a set of kernels individually and thenfrom within KernelMerge. The runtime overhead can bebroken down into the following components

1. Processing a data structure in global memory to se-lect which requested kernel workgroup to run, and toensure that each such workgroup is run exactly once.

2. Computing the translation from the actual workitem

Page 4: Fine-Grained Resource Sharing for Concurrent GPGPU Kernels

0   0.2   0.4   0.6   0.8   1   1.2  

floydWarshallPass_kmeans_swap  kmeans_swap_simpleConvolu?on  

fastWalshTransform_kmeans_kernel_c  bitonicSort_kmeans_kernel_c  

NearestNeighbor_kmeans_swap  bitonicSort_kmeans_swap  

fastWalshTransform_kmeans_swap  fastWalshTransform_simpleConvolu?on  

NearestNeighbor_simpleConvolu?on  NearestNeighbor_fastWalshTransform  

bitonicSort_simpleConvolu?on  NearestNeighbor_bitonicSort  

bitonicSort_fastWalshTransform  floydWarshallPass_kmeans_kernel_c  

fastWalshTransform_floydWarshallPass  bitonicSort_floydWarshallPass  

floydWarshallPass_simpleConvolu?on  NearestNeighbor_floydWarshallPass  

Speedup  

Figure 4: Merged runtimes compared to sequentialruntimes. In this figure, pairs of OpenCL kernelswere run concurrently using KernelMerge and com-pared to the time to run the pair sequentially. 39%of the pairs ran faster than they would run sequen-tially. However, one pair ran over 3x slower thanthe sequential runs, indicating that naively schedul-ing kernels together could lead to poor performance.Not all combinations of the seven benchmark kernelsrun together because some applications only containa single kernel invocation.

IDs to the IDs needed for the selected workgroup (i.e.,spoofing).

3. Additional control flow necessary because the computeunit ID (i.e., the ID differentiating the groups of cores)is not accessible through the API.

The complexity of the scheduler along with the fact thatevery potentially schedulable kernel is included in a singlekernel combine to increase the kernel’s register usage. Thisreduces the number of workgroups that may simultaneouslyoccupy the device, even if only a single kernel is actuallyscheduled, due to limitations on the number of availableregisters on the device. Similarly, the number of concur-rent workgroups is limited by the available shared memory;if any schedulable kernel uses shared memory, all schedu-lable kernels are limited by it, even if no kernel that usesshared memory is actually scheduled.

One final limitation of the KernelMerge approach affectsthe use of barriers within kernels. If two kernels that usedifferent total numbers of workitems are scheduled, the un-used workitems must be disabled. If the kernel with thesmaller workgroup uses a barrier, the disabled workitemswill never execute the barrier, causing the entire workgroupto deadlock.

2.5 Suggested GPU Architecture ChangesMany of the limitations discussed above arise because to-

day’s GPUs are built on the assumption that a single kernelruns at a time. We recommend that future GPU architec-tures explicitly support concurrent execution of independentkernels at workgroup granularity. We believe that imple-menting the scheduler in software remains the most appro-priate and flexible method and that the concerns from theprevious subsection can be relieved through relatively sim-

0   200   400   600   800   1000   1200   1400   1600  

bitonicSort  

fastWalshTransform  

floydWarshallPass  

kmeans_kernel_c  

kmeans_swap  

NearestNeighbor  

simpleConvoluFon  

Run$me  (ms)  

Kernel  RunFme   Scheduling  Overhead  

Figure 5: KernelMerge overhead. Overhead doesnot affect all kernels the same. Short running ker-nels that contain a large number of kernel work-groups are adversely affected (e.g., Bitonic Sort).

ple hardware changes. First, allowing the registers used todetermine workgroup IDs to be written would eliminate theneed for spoofing. Second, there are resources that need tobe allocated per workgroup, such as local memory and bar-riers. For instance, because workgroups of different kernelsmay contain different numbers of workitems, the hardwaremust allow the scheduler to specify to the barriers how manyworkitems to expect. Third, exposing the compute unitID through the API would eliminate extra control flow andwould expand the space of possible scheduling algorithms.For instance, a scheduler could assign all instances of a ker-nel to a particular compute unit or set of compute units.

Currently, KernelMerge incurs significant overhead be-cause of the necessity of working around hardware limita-tions to multiple kernel execution. The minimal hardwarechanges we suggest would enable easier creation of sched-ulers that implement multiple kernel execution we have pre-sented in this paper.

3. CASE STUDY: DETERMINING OPTIMALWORKGROUP BALANCE

We now show how KernelMerge can be used to find theoptimal workgroup partitioning for concurrent kernels. Sec-tion 2.2 described the KernelMerge scheduling algorithmwhereby each kernel is given a set number of workgroups, outof the total workgroups available (determined by a helperapplication) for the kernel pair. In order to find the min-imal runtime for each kernel pair, we wrote a script thatsweeps through all workgroup combinations for each pair,and runs the scheduler accordingly. We also ran each kernelpair sequentially and calculated the total sequential time.

Figure 6 shows a graph of six kernel pairs and the min-imum times for each pair. For the benchmarks we used,the device (an NVIDIA GTX 460) was able to run 28 work-groups on the device at one time. The first 27 bars on thegraph of each kernel pair show the combined runtime withthe workgroup combinations, and the last bar on each graphshows the sequential runtime. The darker column showsthe minimum runtime, and thus the optimal runtime for thepair. Many of the graphs show a“bathtub”shape, indicating

Page 5: Fine-Grained Resource Sharing for Concurrent GPGPU Kernels

time%(m

s)%

time%(m

s)

Workgroups%in%First%Kernel Workgroups%in%First%KernelbitonicSort_kmeans_kernel_c bitonicSort_kmeans_swap

time%(m

s)

time%(m

s)

Workgroups%in%First%Kernel Workgroups%in%First%KernelfastWalshTransform_kmeans_swap floydWarshallPass_kmeans_swap

time%(m

s)

time%(m

s)

Workgroups%in%First%Kernel Workgroups%in%First%Kernelkmeans_swap_simpleConvolution NearestNeighbor_kmeans_swap

5?

0?

5?

0?

5?

0?

5?

0?

5?

0?

5?

0?

Figure 6: Runtimes for concurrent kernels with thepartitioning algorithm. The dark column with thearrow indicates the minimum value. The final col-umn indicates the combined sequential runtime foreach merged kernel pair.

that both kernels suffer significantly when their computationresources (i.e. scheduler workgroups) are restricted. In thecases that show an increasing trend towards the right, thefirst kernel is able to execute efficiently, even with limitedcomputation resources. In one case (bitonicSort-kmeans-kernel-c), the sequential runtime is less than all of the sched-uler runtime combinations, indicating that running the ker-nels concurrently is detrimental.

This case study demonstrated that KernelMerge can beused for determining the optimal workgroup partitioning forconcurrent kernels, and it also demonstrates that naivelyscheduling the kernels together can produce undesirable re-sults.

Figure 7 compares the results of the work-stealing algo-rithm to the sweet spots found by the partitioning algo-rithm. As the figure shows, the work-stealing algorithm ap-proaches the optimal results found by the partitioning algo-rithm, without needing to perform an exhaustive search.

4. RELATED WORKAs of version 3.0, NVIDIA’s CUDA has a limited ability

to run concurrent kernels devices with compute capability2.0 or above (e.g., Fermi GPUs) [1]. The kernels must berun in different CUDA streams that are explicitly set upby the programmer, and the CUDA runtime does not guar-antee that kernels will run concurrently. Furthermore, thenumber of workgroups per kernel that are run concurrentlyis not definable by the progarmmer. KernelMerge allowsworkgroup counts to be set in the scheduler, and althoughthere are crucial overheads at this point in time, minimalhardware changes would significantly increase the efficiencyof KernelMerge.

Guevara et al. [4] discuss a CUDA scheduler that runsorthogonally bound kernels to increase throughput for bothkernels, although all kernels must be hand-merged and an-alyzed by hand. Wang et al. [8] demonstrate the benefits ofkernel “funneling” of CUDA kernels into a context so thatthey can be run concurrently. They do not discuss how thekernels interfere with each other on a GPU, although their

0" 400" 800" 1200" 1600"

bitonicSort_fastWalshTransform"bitonicSort_simpleConvolu>on"NearestNeighbor_bitonicSort"bitonicSort_kmeans_kernel_c"

fastWalshTransform_simpleConvolu>on"NearestNeighbor_fastWalshTransform"fastWalshTransform_kmeans_kernel_c"NearestNeighbor_simpleConvolu>on"

bitonicSort_floydWarshallPass"fastWalshTransform_floydWarshallPass"

floydWarshallPass_kmeans_kernel_c"floydWarshallPass_simpleConvolu>on"

bitonicSort_kmeans_swap"fastWalshTransform_kmeans_swap"

NearestNeighbor_kmeans_swap"kmeans_swap_simpleConvolu>on"

NearestNeighbor_floydWarshallPass"floydWarshallPass_kmeans_swap"

Run$me'(ms)'

par>>oning" workstealing"

Figure 7: Results of the work-stealing schedulercompared to the sweet spot partitioning schedulerresult. In most cases, the work-stealing algorithmapproaches the partitioning sweet spot.

benchmarks are most likely memory bound as they performrelatively few calculations per memory access.

Chen et al. [3] propose a dynamic load balancing schemefor single and multiple GPUs. They implement a task queuewith similar functionality to KernelMerge and that launchesthread blocks (the CUDA equivalent of OpenCL workgroups)independently of the normal CUDA launch behavior. Theyshow a slight increase in performance on a single GPU usingtheir scheduler, due to their scheduler’s ability to begin ex-ecution of new thread blocks while other thread blocks arestill running longer kernels. They do not discuss load bal-ancing as it relates to the type of bounding exhibited by in-dividual kernels. Wang et al. [7] propose a source-code levelkernel merging scheme in the interest of better device utiliza-tion for decreased energy usage. Their tests are limited toGPGPU-Sim [2], but their experiments indicate significantenergy reduction for fused kernels over serial execution.

Li et al. [6] propose a GPU virtualization layer that allowsmultiple microprocessors to share GPU resources. Their vir-tualization layer implements concurrent kernels in CUDA,and they discuss computation and memory boundedness(which they call “compute intensive” and “I/O-intensive”) asit relates to running kernels concurrently. It is not clear howthey classified the individual kernels in order to determinehow they were bound.

5. CONCLUSIONWe have presented a justification for our argument that

running concurrent kernels on a GPU can be beneficial foroverall compute throughput. We also showed that naivelyscheduling kernel pairs together can have a negative effect,and therefore it is necessary to dynamically schedule ker-nels together. We demonstrated KernelMerge, an OpenCLconcurrent kernel scheduler that has a number of dials im-portant for concurrent kernel research. We have also demon-strated that a work-stealing algorithm for merging two ker-nels can approximate optimal workgroup partitioning. Forfuture work, we plan on investigating how to characterize re-source consumption for individual kernels in order to makedynamic scheduling decisions.

Page 6: Fine-Grained Resource Sharing for Concurrent GPGPU Kernels

6. REFERENCES[1] CUDA Programming Guide 3.1. http://developer.

download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_ProgrammingGuide_3.1.pdf,June 2010.

[2] A. Bakhoda, G. Yuan, W. Fung, H. Wong, andT. Aamodt. Analyzing cuda workloads using a detailedgpu simulator. In Performance Analysis of Systems andSoftware, 2009. ISPASS 2009. IEEE InternationalSymposium on, pages 163 –174, april 2009.

[3] L. Chen, O. Villa, S. Krishnamoorthy, and G. Gao.Dynamic load balancing on single- and multi-gpusystems. In Parallel Distributed Processing (IPDPS),2010 IEEE International Symposium on, pages 1 –12,april 2010.

[4] M. Guevara, C. Gregg, K. Hazelwood, and K. Skadron.Enabling task parallelism in the cuda scheduler. InProgramming Models and Emerging ArchitecturesWorkshop (Parallel Architectures and CompilationTechniques Conference), 2009.

[5] D. Lee, I. Dinov, B. Dong, B. Gutman, I. Yanovsky,

and A. Toga. CUDA optimization strategies forcompute-and memory-bound neuroimaging algorithms.Computer methods and programs in biomedicine, pages1–13, December 2010.

[6] T. Li, V. Narayana, E. El-Araby, and T. El-Ghazawi.Gpu resource sharing and virtualization on highperformance computing systems. In Parallel Processing(ICPP), 2011 International Conference on, pages 733–742, sept. 2011.

[7] G. Wang, Y. Lin, and W. Yi. Kernel fusion: Aneffective method for better power efficiency onmultithreaded gpu. In Green Computing andCommunications (GreenCom), 2010 IEEE/ACM Int’lConference on Int’l Conference on Cyber, Physical andSocial Computing (CPSCom), pages 344 –350, dec.2010.

[8] L. Wang, M. Huang, and T. El-Ghazawi. Exploitingconcurrent kernel execution on graphic processingunits. In High Performance Computing and Simulation(HPCS), 2011 International Conference on, pages 24–32, july 2011.