Upload
buidung
View
228
Download
0
Embed Size (px)
Citation preview
Fine-Grained Resource Sharing for Concurrent GPGPUKernels
Chris Gregg Jonathan Dorn Kim Hazelwood Kevin SkadronDepartment of Computer Science
University of VirginiaPO Box 400740
Charlottesville, VA 22904
ABSTRACTGeneral purpose GPU (GPGPU) programming frameworkssuch as OpenCL and CUDA allow running individual com-putation kernels sequentially on a device. However, in somecases it is possible to utilize device resources more efficientlyby running kernels concurrently. This raises questions aboutload balancing and resource allocation that have not previ-ously warranted investigation. For example, what kernelcharacteristics impact the optimal partitioning of resourcesamong concurrently executing kernels? Current frameworksdo not provide the ability to easily run kernels concurrentlywith fine-grained and dynamic control over resource parti-tioning. We present KernelMerge, a kernel scheduler thatruns two OpenCL kernels concurrently on one device. Ker-nelMerge furnishes a number of settings that can be usedto survey concurrent or single kernel configurations, and toinvestigate how kernels interact and influence each other,or themselves. KernelMerge provides a concurrent kernelscheduler compatible with the OpenCL API.
We present an argument on the benefits of running kernelsconcurrently. We demonstrate how to use KernelMerge toincrease throughput for two kernels that efficiently use de-vice resources when run concurrently, and we establish thatsome kernels show worse performance when running con-currently. We also outline a method for using KernelMergeto investigate how concurrent kernels influence each other,with the goal of predicting runtimes for concurrent executionfrom individual kernel runtimes. Finally, we suggest GPUarchitectural changes that would improve such concurrentschedulers in the future.
1. INTRODUCTIONGeneral purpose GPU (GPGPU) computing capitalizes
on the massively parallel resources of a modern GPU toenable high compute throughput. Programming languagessuch as CUDA and OpenCL have provided application de-velopers with a familiar environment with which to exploitGPU capabilities. However, there remain barriers to massdeployment of GPGPU technology, one of which is the in-ability to easily run multiple applications concurrently. Inaddition, because of architectural differences among devices,an optimization for one GPU may result in worse utilizationon other GPUs. Even highly performance-optimized ker-nels may have memory- or compute-bound behavior, leav-ing some resources underutilized. We propose that runninganother kernel concurrently with this first kernel can takeadvantage of these underutilized resources, improving over-all system throughput when many kernels are available to
run. Concurrent execution provides the additional benefit ofexploiting the entire device, even when executing less thanfully-optimized kernels. This eases the programmer burden,since optimization requires time that may be profitably re-allocated.
Current GPGPU frameworks do not have the ability torun concurrent kernels with fine-grained control over re-sources available to each kernel, and we argue that sucha feature will be increasingly necessary as the popularity ofGPU computing grows. In this paper, we present our argu-ment and describe KernelMerge, a prototype kernel sched-uler that has this ability. We demonstrate a case study usingKernelMerge to find the optimal resource allocation for twokernels running simultaneously, and we show that a naiveapproach to this static assignment can lead to a degrada-tion in performance over running the kernels sequentially.
GPGPU code can provide tremendous speed increases overmulticore CPUs for computational work. However, the na-ture of GPU architectural designs can lead to underutiliza-tion of resources: GPU kernels can either saturate the com-putational units on the device (the applications are computebound), or they can saturate the available memory band-width (they are bandwidth bound). Application developersare cautioned to attempt to balance these two concerns sothat a GPU kernel performs most efficiently, but this opti-mization can be tedious and in many cases algorithms can-not be rewritten to balance computation and memory band-width equally [1, 5]. Figure 1 shows an example of a com-pute bound kernel. To generate the figure, both the GPUmemory clock and the GPU processor clock were indepen-dently adjusted for a Merge Sort benchmark kernel, whichuses shared memory to decrease the number of global mem-ory writes, leading to compute-bound behavior. As the fig-ure shows, changing the memory clock frequency does notchange the runtime, but changing the processor clock fre-quency does. In other words, the kernel is compute bound.Figure 2 shows an example of a bandwidth bound kernel,Vector Add. In contrast to Figure 1, the bandwidth boundkernel is affected by the memory clock frequency adjust-ments, and not the processor clock adjustments.
When a compute bound or bandwidth bound kernel runsalone on the GPU, it may not fully utilize the resources thatthe GPU has to offer. In the cases where it does not realizethe ideal utilization, our goal is to change the ratio of com-pute instructions to memory instructions so that the deviceis fully utilized. Specifically, when a compute bound kernelruns, the memory controllers may be idle while waiting forthe computational units, and when a bandwidth bound ker-nel is running, ALUs and other computational circuits may
600
580 560 540 520 50
52
54
56
58
60
62
64
66
1100 1080
1060 1040 1020
GPU Processor Clock
(MHz)
Kernel Tim
e (m
s)
GPU Memory Clock (MHz)
Figure 1: Merge Sort: a compute bound example.The horizontal axis shows a sweep of the GPU mem-ory clock, which does not change the kernel runningtime. The depth axis shows a sweep of the GPUprocessor clock, and as the frequency is lowered, therunning time is significantly increased.
be forced to wait for memory loads or stores. The premisefor this work is that it is possible to increase utilization of theentire device by judiciously running multiple concurrent ker-nels that will together utilize the available resources. UsingKernelMerge we saw up to 18% speedup for two concurrentkernels over the time to run the same kernels sequentially.
2. KERNELMERGE
2.1 OverviewThe goal of KernelMerge is to enable concurrent kernel
execution on GPUs, and to provide application developersand GPGPU researchers a set of knobs to change schedulerparameters. Consequently, KernelMerge can be used to tuneperformance when running multiple kernels simultaneously(e.g., to maximize kernel throughput), and as a platformfor research into concurrent kernel scheduling. For example,KernelMerge can set the total number of workgroups run-ning on a GPU, and can set a fixed number of workgroupsper kernel in order to dedicate a percentage of resources toone or both kernels. By sweeping through these values andcapturing runtimes (which KernelMerge provides), develop-ers can determine the best parameters for their kernels, orresearchers can see trends that may allow runtime predic-tions.
KernelMerge is a software scheduler, built in C++ andOpenCL for the Linux platform, that meets the need forsimultaneous kernel execution on GPUs without overly bur-dening application developers with significant code rewrites.The C++ code “hijacks” the OpenCL runtime, providing itsown implementation of many OpenCL API calls. For ex-ample, a custom implementation of clEnqueueNDRangeKer-nel, which dispatches kernels to the device, uses the currentscheduling algorithm to determine when to invoke the ker-nel instead of the default OpenCL FIFO queue. The call-ing code is unchanged, using the same API call with thesame arguments but gaining the new functionality. In factmost host code can take advantage of the scheduler with nochanges save a single additional #include declaration.
KernelMerge includes additional features that are of in-
600
580
560 540 520 7
7.2
7.4
7.6
7.8
8
8.2
8.4
1100 1080
1060 1040 1020
GPU Processor
Clock (M
Hz)
Kernel Tim
e (m
s)
GPU Memory Clock (MHz)
Figure 2: Vector Add: a memory bandwidth boundexample. The horizontal axis shows a sweep of theGPU memory clock, and as the frequency is lowered,the kernel running time increases. The depth axisshows a sweep of the GPU processor clock, whichdoes not affect the running time.
terest to GPGPU developers and researchers. It has theability to determine the maximum number of workgroups oftwo kernels that will run at once on a device, which can in-fluence the partitioning scheme that the scheduler uses, andcan also be useful when analyzing performance or algorithmdesign.
We envision KernelMerge evolving from its current formas mainly a research tool into a robust scheduler that couldbe implemented by an operating system for high-throughputGPGPU scheduling. We developed KernelMerge to allowus to run independent OpenCL kernels simultaneously ona single OpenCL-capable device. KernelMerge also has theability to run single kernels by themselves, which allows mea-surement of scheduler overhead and also enables using thescheduler settings to investigate how a single kernel behaves.
At the highest level, KernelMerge combines two invoca-tions of the OpenCL enqueueNDRangeKernel (one from eachapplication wishing to run a kernel) into a single kernellaunch that consists of the scheduler itself. In other words,as far as the GPU is concerned, there is a single kernel run-ning on the device. Before launching the scheduler kernel,the scheduler passes the individual kernel parameters, andbuffer, workgroup, and workitem information to the device,which the scheduler uses to run the individual kernels on thedevice.
The scheduler creates a parametrically defined numberof “scheduler workgroups,” that the device sees as a singleset of workgroups for the kernel. The scheduler then dis-patches a workgroup (a “kernel workgroup”) from the indi-vidual kernels to each scheduler workgroups. There are twomain scheduling algorithms that we have developed: workstealing and partitioning by kernel, described in Section 2.2.The scheduler workgroups execute their assigned kernels asif they were the assigned kernel workgroup, using a tech-nique we call “spoofing.”
OpenCL kernels rely on a set of IDs to determine theirrole in a kernel algorithm. Because a kernel workgroupcan get launched into any scheduler workgroup, the realglobal, workgroup, and workitem IDs will not correspond tothe original IDs that the kernel developer had designated.
Function Returns
get global id() The unique global work-item IDget global size() The number of global work-itemsget group id() The work-group IDget local id() The unique local work-item ID
get local size() The number of local work-itemsget num groups() The number of work-groups that
will execute a kernel
Table 1: Workitem and workgroup identificationavailable to a running OpenCL kernel.
Therefore, when the scheduler dispatches the kernel work-groups into the scheduler workgroups, it must “spoof” theIDs so that the kernel workgroup sees the correct values.One benefit to the spoofing is that the technique transpar-ently allows us to concurrently schedule a kernel that uses alinear workgroup with another that uses a two-dimensionalworkgroup. This spoofing is responsible for a significant por-tion of the scheduler overhead, as discussed in Section 2.4.Table 1 shows the ID information that the scheduler spoofs.
Once the spoofing is established, the scheduler“dispatches”the workgroups which carry out the work they have been as-signed. The scheduler continues to partition out the workuntil no work remains, at which point the scheduler (andthus the kernel that the device sees) completes.
The scheduler collects runtime data for the scheduler ker-nel, and can report on the runtime for the overall kernel.Unfortunately, because the GPU sees the scheduler as a sin-gle monolithic kernel, the scheduler cannot obtain runtimesfor the individual kernels.
2.2 Scheduling AlgorithmsThere are two different scheduling algorithms that Ker-
nelMerge utilizes. The first is a round-robin work-stealingalgorithm that assigns work from each kernel on a first-come,first-served basis. The second algorithm assigns a fixed per-centage of workgroups to each individual kernel.
Figure 3 shows a graphical representation of the work-stealing algorithm on the left, and the static assignmentof a fixed percentage of scheduler workgroups to each ker-nel on the right. For the work-stealing algorithm, Kernelworkgroups are assigned alternating locations in a queue.If there are more workgroups of one kernel, the extra work-groups occupy consecutive positions at the end of the queue.Scheduler workgroups are assigned the kernel workgroups ina first-come, first-served fashion, so that at the beginninghalf the scheduler workgroups are running one kernel andthe other half are running the other kernel. As each sched-uler workgroup finishes, it is re-assigned more work fromthe head of the queue. With this algorithm, the allocationof resources to each kernel is not fixed, as it depends onthe run time of each kernel. Once one kernel has finishedrunning, all of the scheduler workgroups will be running theremaining kernel workgroups until it, too, completes.
In the partitioning algorithm, when a scheduler workgroupfinishes, it is only reassigned work from one kernel. Thisensures that the percentage of device resources assigned toeach kernel remains fixed. This algorithm runs as few as oneworkgroup of the first kernel while dedicating all remainingdevice resources to workgroups of the second kernel (andvice-versa). This ability is one of the most powerful settings
1" 2" 3" 4" 5" 6" 7" 8"
2"1"
2"3"
4"1"
4" 6"
9"
5" 3"
5"
6"7" 8"
9"
7" 8" 9"
Tim
e
WG1
WG2
WG3
WG4
WG1
WG2
WG3
WG4
(a) Work-stealing Algorithm
1" 2" 3" 4" 5" 6" 7" 8"
2"1"
2"3"
4"1"
4" 6"
9"
5" 3"
5"
6"7" 8"
9"
7" 8" 9"
Tim
e
WG1
WG2
WG3
WG4
WG1
WG2
WG3
WG4
(b) Partitioning Algorithm
Figure 3: Work-stealing –vs– partitioning. Theblack boxes represent one kernel and the whiteboxes represent the other. In (b), one-fourth ofthe available scheduler workgroups are assignedthe short-running kernel (1,3,5), and the remain-ing scheduler workgroups are assigned to the long-running kernel. As scheduler workgroups finish,work is replaced from their assigned kernel.
available to KernelMerge, allowing high precision tuning ofdevice resources assigned to each kernel. For example, ifa highly memory-bound kernel is run concurrently with ahighly compute-bound kernel, the percentage of device re-sources can be adjusted so that the memory-bound kernelis assigned just enough such that it is barely saturating thememory system, and the rest of the resources can be ded-icated to running the compute bound kernel, maximizingthroughput for both kernels.
2.3 ResultsFigure 4 shows a comparison between pairs of kernels that
were run sequentially and were also run concurrently usingKernelMerge. The figure shows that 39% of the paired ker-nels showed a speedup when run concurrently. However,naively scheduling kernels to run together can be highlydetrimental; in some cases running two kernels concurrentlyleads to over 3x slowdown. These results provide a furthermotivation for the use of KernelMerge as a research tool:being able to predict when two kernels will positively ornegatively affect the concurrent runtime is non-trivial. Thesettings provided within KernelMerge have the potential tohelp identify general conditions that result in either benefi-cial or adverse concurrent kernel parings, which can in turnlead to the ability to predict runtimes and therefore makeinformed decisions about whether or not to run kernels con-currently. We defer the results of the partitioning algorithmto section 3 where we show a case study using this algorithm.
2.4 Limitations and Runtime OverheadAs with any runtime scheduler, there are overhead costs
to running kernels with KernelMerge. Figure 5 shows theoverhead of running a set of kernels individually and thenfrom within KernelMerge. The runtime overhead can bebroken down into the following components
1. Processing a data structure in global memory to se-lect which requested kernel workgroup to run, and toensure that each such workgroup is run exactly once.
2. Computing the translation from the actual workitem
0 0.2 0.4 0.6 0.8 1 1.2
floydWarshallPass_kmeans_swap kmeans_swap_simpleConvolu?on
fastWalshTransform_kmeans_kernel_c bitonicSort_kmeans_kernel_c
NearestNeighbor_kmeans_swap bitonicSort_kmeans_swap
fastWalshTransform_kmeans_swap fastWalshTransform_simpleConvolu?on
NearestNeighbor_simpleConvolu?on NearestNeighbor_fastWalshTransform
bitonicSort_simpleConvolu?on NearestNeighbor_bitonicSort
bitonicSort_fastWalshTransform floydWarshallPass_kmeans_kernel_c
fastWalshTransform_floydWarshallPass bitonicSort_floydWarshallPass
floydWarshallPass_simpleConvolu?on NearestNeighbor_floydWarshallPass
Speedup
Figure 4: Merged runtimes compared to sequentialruntimes. In this figure, pairs of OpenCL kernelswere run concurrently using KernelMerge and com-pared to the time to run the pair sequentially. 39%of the pairs ran faster than they would run sequen-tially. However, one pair ran over 3x slower thanthe sequential runs, indicating that naively schedul-ing kernels together could lead to poor performance.Not all combinations of the seven benchmark kernelsrun together because some applications only containa single kernel invocation.
IDs to the IDs needed for the selected workgroup (i.e.,spoofing).
3. Additional control flow necessary because the computeunit ID (i.e., the ID differentiating the groups of cores)is not accessible through the API.
The complexity of the scheduler along with the fact thatevery potentially schedulable kernel is included in a singlekernel combine to increase the kernel’s register usage. Thisreduces the number of workgroups that may simultaneouslyoccupy the device, even if only a single kernel is actuallyscheduled, due to limitations on the number of availableregisters on the device. Similarly, the number of concur-rent workgroups is limited by the available shared memory;if any schedulable kernel uses shared memory, all schedu-lable kernels are limited by it, even if no kernel that usesshared memory is actually scheduled.
One final limitation of the KernelMerge approach affectsthe use of barriers within kernels. If two kernels that usedifferent total numbers of workitems are scheduled, the un-used workitems must be disabled. If the kernel with thesmaller workgroup uses a barrier, the disabled workitemswill never execute the barrier, causing the entire workgroupto deadlock.
2.5 Suggested GPU Architecture ChangesMany of the limitations discussed above arise because to-
day’s GPUs are built on the assumption that a single kernelruns at a time. We recommend that future GPU architec-tures explicitly support concurrent execution of independentkernels at workgroup granularity. We believe that imple-menting the scheduler in software remains the most appro-priate and flexible method and that the concerns from theprevious subsection can be relieved through relatively sim-
0 200 400 600 800 1000 1200 1400 1600
bitonicSort
fastWalshTransform
floydWarshallPass
kmeans_kernel_c
kmeans_swap
NearestNeighbor
simpleConvoluFon
Run$me (ms)
Kernel RunFme Scheduling Overhead
Figure 5: KernelMerge overhead. Overhead doesnot affect all kernels the same. Short running ker-nels that contain a large number of kernel work-groups are adversely affected (e.g., Bitonic Sort).
ple hardware changes. First, allowing the registers used todetermine workgroup IDs to be written would eliminate theneed for spoofing. Second, there are resources that need tobe allocated per workgroup, such as local memory and bar-riers. For instance, because workgroups of different kernelsmay contain different numbers of workitems, the hardwaremust allow the scheduler to specify to the barriers how manyworkitems to expect. Third, exposing the compute unitID through the API would eliminate extra control flow andwould expand the space of possible scheduling algorithms.For instance, a scheduler could assign all instances of a ker-nel to a particular compute unit or set of compute units.
Currently, KernelMerge incurs significant overhead be-cause of the necessity of working around hardware limita-tions to multiple kernel execution. The minimal hardwarechanges we suggest would enable easier creation of sched-ulers that implement multiple kernel execution we have pre-sented in this paper.
3. CASE STUDY: DETERMINING OPTIMALWORKGROUP BALANCE
We now show how KernelMerge can be used to find theoptimal workgroup partitioning for concurrent kernels. Sec-tion 2.2 described the KernelMerge scheduling algorithmwhereby each kernel is given a set number of workgroups, outof the total workgroups available (determined by a helperapplication) for the kernel pair. In order to find the min-imal runtime for each kernel pair, we wrote a script thatsweeps through all workgroup combinations for each pair,and runs the scheduler accordingly. We also ran each kernelpair sequentially and calculated the total sequential time.
Figure 6 shows a graph of six kernel pairs and the min-imum times for each pair. For the benchmarks we used,the device (an NVIDIA GTX 460) was able to run 28 work-groups on the device at one time. The first 27 bars on thegraph of each kernel pair show the combined runtime withthe workgroup combinations, and the last bar on each graphshows the sequential runtime. The darker column showsthe minimum runtime, and thus the optimal runtime for thepair. Many of the graphs show a“bathtub”shape, indicating
time%(m
s)%
time%(m
s)
Workgroups%in%First%Kernel Workgroups%in%First%KernelbitonicSort_kmeans_kernel_c bitonicSort_kmeans_swap
time%(m
s)
time%(m
s)
Workgroups%in%First%Kernel Workgroups%in%First%KernelfastWalshTransform_kmeans_swap floydWarshallPass_kmeans_swap
time%(m
s)
time%(m
s)
Workgroups%in%First%Kernel Workgroups%in%First%Kernelkmeans_swap_simpleConvolution NearestNeighbor_kmeans_swap
5?
0?
5?
0?
5?
0?
5?
0?
5?
0?
5?
0?
Figure 6: Runtimes for concurrent kernels with thepartitioning algorithm. The dark column with thearrow indicates the minimum value. The final col-umn indicates the combined sequential runtime foreach merged kernel pair.
that both kernels suffer significantly when their computationresources (i.e. scheduler workgroups) are restricted. In thecases that show an increasing trend towards the right, thefirst kernel is able to execute efficiently, even with limitedcomputation resources. In one case (bitonicSort-kmeans-kernel-c), the sequential runtime is less than all of the sched-uler runtime combinations, indicating that running the ker-nels concurrently is detrimental.
This case study demonstrated that KernelMerge can beused for determining the optimal workgroup partitioning forconcurrent kernels, and it also demonstrates that naivelyscheduling the kernels together can produce undesirable re-sults.
Figure 7 compares the results of the work-stealing algo-rithm to the sweet spots found by the partitioning algo-rithm. As the figure shows, the work-stealing algorithm ap-proaches the optimal results found by the partitioning algo-rithm, without needing to perform an exhaustive search.
4. RELATED WORKAs of version 3.0, NVIDIA’s CUDA has a limited ability
to run concurrent kernels devices with compute capability2.0 or above (e.g., Fermi GPUs) [1]. The kernels must berun in different CUDA streams that are explicitly set upby the programmer, and the CUDA runtime does not guar-antee that kernels will run concurrently. Furthermore, thenumber of workgroups per kernel that are run concurrentlyis not definable by the progarmmer. KernelMerge allowsworkgroup counts to be set in the scheduler, and althoughthere are crucial overheads at this point in time, minimalhardware changes would significantly increase the efficiencyof KernelMerge.
Guevara et al. [4] discuss a CUDA scheduler that runsorthogonally bound kernels to increase throughput for bothkernels, although all kernels must be hand-merged and an-alyzed by hand. Wang et al. [8] demonstrate the benefits ofkernel “funneling” of CUDA kernels into a context so thatthey can be run concurrently. They do not discuss how thekernels interfere with each other on a GPU, although their
0" 400" 800" 1200" 1600"
bitonicSort_fastWalshTransform"bitonicSort_simpleConvolu>on"NearestNeighbor_bitonicSort"bitonicSort_kmeans_kernel_c"
fastWalshTransform_simpleConvolu>on"NearestNeighbor_fastWalshTransform"fastWalshTransform_kmeans_kernel_c"NearestNeighbor_simpleConvolu>on"
bitonicSort_floydWarshallPass"fastWalshTransform_floydWarshallPass"
floydWarshallPass_kmeans_kernel_c"floydWarshallPass_simpleConvolu>on"
bitonicSort_kmeans_swap"fastWalshTransform_kmeans_swap"
NearestNeighbor_kmeans_swap"kmeans_swap_simpleConvolu>on"
NearestNeighbor_floydWarshallPass"floydWarshallPass_kmeans_swap"
Run$me'(ms)'
par>>oning" workstealing"
Figure 7: Results of the work-stealing schedulercompared to the sweet spot partitioning schedulerresult. In most cases, the work-stealing algorithmapproaches the partitioning sweet spot.
benchmarks are most likely memory bound as they performrelatively few calculations per memory access.
Chen et al. [3] propose a dynamic load balancing schemefor single and multiple GPUs. They implement a task queuewith similar functionality to KernelMerge and that launchesthread blocks (the CUDA equivalent of OpenCL workgroups)independently of the normal CUDA launch behavior. Theyshow a slight increase in performance on a single GPU usingtheir scheduler, due to their scheduler’s ability to begin ex-ecution of new thread blocks while other thread blocks arestill running longer kernels. They do not discuss load bal-ancing as it relates to the type of bounding exhibited by in-dividual kernels. Wang et al. [7] propose a source-code levelkernel merging scheme in the interest of better device utiliza-tion for decreased energy usage. Their tests are limited toGPGPU-Sim [2], but their experiments indicate significantenergy reduction for fused kernels over serial execution.
Li et al. [6] propose a GPU virtualization layer that allowsmultiple microprocessors to share GPU resources. Their vir-tualization layer implements concurrent kernels in CUDA,and they discuss computation and memory boundedness(which they call “compute intensive” and “I/O-intensive”) asit relates to running kernels concurrently. It is not clear howthey classified the individual kernels in order to determinehow they were bound.
5. CONCLUSIONWe have presented a justification for our argument that
running concurrent kernels on a GPU can be beneficial foroverall compute throughput. We also showed that naivelyscheduling kernel pairs together can have a negative effect,and therefore it is necessary to dynamically schedule ker-nels together. We demonstrated KernelMerge, an OpenCLconcurrent kernel scheduler that has a number of dials im-portant for concurrent kernel research. We have also demon-strated that a work-stealing algorithm for merging two ker-nels can approximate optimal workgroup partitioning. Forfuture work, we plan on investigating how to characterize re-source consumption for individual kernels in order to makedynamic scheduling decisions.
6. REFERENCES[1] CUDA Programming Guide 3.1. http://developer.
download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_ProgrammingGuide_3.1.pdf,June 2010.
[2] A. Bakhoda, G. Yuan, W. Fung, H. Wong, andT. Aamodt. Analyzing cuda workloads using a detailedgpu simulator. In Performance Analysis of Systems andSoftware, 2009. ISPASS 2009. IEEE InternationalSymposium on, pages 163 –174, april 2009.
[3] L. Chen, O. Villa, S. Krishnamoorthy, and G. Gao.Dynamic load balancing on single- and multi-gpusystems. In Parallel Distributed Processing (IPDPS),2010 IEEE International Symposium on, pages 1 –12,april 2010.
[4] M. Guevara, C. Gregg, K. Hazelwood, and K. Skadron.Enabling task parallelism in the cuda scheduler. InProgramming Models and Emerging ArchitecturesWorkshop (Parallel Architectures and CompilationTechniques Conference), 2009.
[5] D. Lee, I. Dinov, B. Dong, B. Gutman, I. Yanovsky,
and A. Toga. CUDA optimization strategies forcompute-and memory-bound neuroimaging algorithms.Computer methods and programs in biomedicine, pages1–13, December 2010.
[6] T. Li, V. Narayana, E. El-Araby, and T. El-Ghazawi.Gpu resource sharing and virtualization on highperformance computing systems. In Parallel Processing(ICPP), 2011 International Conference on, pages 733–742, sept. 2011.
[7] G. Wang, Y. Lin, and W. Yi. Kernel fusion: Aneffective method for better power efficiency onmultithreaded gpu. In Green Computing andCommunications (GreenCom), 2010 IEEE/ACM Int’lConference on Int’l Conference on Cyber, Physical andSocial Computing (CPSCom), pages 344 –350, dec.2010.
[8] L. Wang, M. Huang, and T. El-Ghazawi. Exploitingconcurrent kernel execution on graphic processingunits. In High Performance Computing and Simulation(HPCS), 2011 International Conference on, pages 24–32, july 2011.