DOI: 10.1177/1094342014526907 design methods for ... · GPUs are now ubiquitous accelerator devices due to their impressive processing potential for a wide class of applications

Original Article

The International Journal of HighPerformance Computing Applications1–16� The Author(s) 2014Reprints and permissions:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/1094342014526907hpc.sagepub.com

Analyzing power efficiency ofoptimization techniques and algorithmdesign methods for applications onheterogeneous platforms

Yash Ukidave, Amir Kavyan Ziabari, Perhaad Mistry, Gunar Schirnerand David Kaeli

AbstractGraphics processing units (GPUs) have become widely accepted as the computing platform of choice in many high per-formance computing domains. The availability of programming standards such as OpenCL are used to leverage the inher-ent parallelism offered by GPUs. Source code optimizations such as loop unrolling and tiling when targeted toheterogeneous applications have reported large gains in performance. However, given the power consumption of GPUs,platforms can exhaust their power budgets quickly. Better solutions are needed to effectively exploit the power-efficiency available on heterogeneous systems. In this work, we evaluate the power/performance efficiency of differentoptimizations used on heterogeneous applications. We analyze the power/performance trade-off by evaluating energyconsumption of the optimizations. We compare the performance of different optimization techniques on four differentfast Fourier transform implementations. Our study covers discrete GPUs, shared memory GPUs (APUs) and low powersystem-on-chip (SoC) devices, and includes hardware from AMD (Llano APUs and the Southern Islands GPU), Nvidia(Kepler), Intel (Ivy Bridge) and Qualcomm (Snapdragon S4) as test platforms. The study identifies the architectural andalgorithmic factors which can most impact power consumption. We explore a range of application optimizations whichshow an increase in power consumption by 27%, but result in more than 1.8 3 increase in speed of performance. Weobserve up to an 18% reduction in power consumption due to reduced kernel calls across FFT implementations. Wealso observe an 11% variation in energy consumption among different optimizations. We highlight how different optimi-zations can improve the execution performance of a heterogeneous application, but also impact the power efficiency ofthe application. More importantly, we demonstrate that different algorithms implementing the same fundamental func-tion (FFT) can perform with vast differences based on the target hardware and associated application design.

KeywordsOpenCL, fast Fourier Transform, power, optimizations, GPUs, system-on-chip

1 Introduction

GPUs are now ubiquitous accelerator devices due totheir impressive processing potential for a wide class ofapplications. GPU architectures are used to attain per-formance speedups for applications from domains suchas scientific computing, biomedical imaging, globalpositioning systems and signal processing applications(Pharr and Fernando, 2005; Deschizeaux and Blanc,2007; Hassanieh et al., 2012). Even though large perfor-mance gains have been reported for computations onGPUs (Owens et al., 2008), power budgets for theGPUs used in these applications can be very high (Sudaand Ren, 2009). This is evident from the fact that the

maximum TDP (thermal design power) of the latestgeneration of GPUs from Nvidia (Kepler GTX 680) issimilar to the last generation (Fermi GTX 580) at 200W (NVIDIA, 2012) (i.e. TDP is not getting better).

A wide range of optimization techniques for hetero-geneous applications have been explored in priorresearch (Ryoo et al., 2008; Liu et al., 2009).

Department of Electrical and Computer Engineering, Northeastern

University, Boston, MA, USA

Corresponding author:

Yash Ukidave, 140 The Fenway, Room 320, Boston, MA 02215, USA.

Email: [email protected]

Optimizations including loop-unrolling and local mem-ory utilization are commonly applied to achieve perfor-mance speedups on GPUs (Ryoo et al., 2008).Architectural factors such as ALU and bandwidth utili-zation have been studied to improve execution effi-ciency of applications on GPUs (Jang et al., 2011).Data transformation techniques have also been used toimprove throughput of applications on GPUs (Janget al., 2009).

Power consumption of applications on heteroge-neous devices such as GPUs has been studied previ-ously (Ye et al., 2000; Suda and Ren, 2009). Statisticalpower models for GPUs based on vendor provided per-formance counters have been developed (Nagasakaet al., 2010). The power consumption of a variety ofmicrobenchmarks running on GPUs has also been stud-ied previously (Kasichayanula et al., 2012).

The interplay between software optimizations andpower consumption, particularly when targeting GPUs,is a subject of growing interest. Power and energy man-agement for applications executing on enterprise-levelclusters has become a necessity. Scientific applicationswith long running persistent computations such asweather-models and mathematical-models can adopt dif-ferent energy saving mechanisms. Such savings in energycan lead to the expansion of compute capability in thegiven energy budget. This paper examines the power/performance of different optimization techniques used totune applications running on heterogeneous platforms.

In our previous work, we have evaluated power effi-ciency of different fast Fourier transforms (FFT) algo-rithms on different heterogeneous architectures(Ukidave et al., 2013). We have extended our priorwork by considering the effects of both source codeand compiler-based optimizations that can be applied,and their impact power/performance.

In this paper, we examine the power consumption ofan application on a GPU when considering differentclasses of optimizations. Our evaluation is carried outon a number of discrete GPUs, shared memory APUsand low power system-on-chip (SoC) devices. Wedemonstrate the power/performance efficiency of theSoC devices for heterogeneous computations using ourevaluations. The impact of different architectural fea-tures of each GPU on power consumption is also dis-cussed. Equipped with this knowledge, GPU platformdevelopers can consider trade-offs between perfor-mance and power-efficiency. The contributions of thispaper can be summarized as follows.

� We provide study on effect of algorithm designmethods on power/performance efficiency of het-erogeneous applications on wide range of devices.

� We describe specific set of performance optimiza-tions appropriate for heterogeneous applicationstargeted for GPUs, and implemented in OpenCL.

� We evaluate each proposed optimization in termsof execution performance and power consumptionacross four different FFT implementations on dis-crete GPUs, shared memory APUs and low powerSoCs.

� We utilize our results to classify power and perfor-mance optimizations separately for heterogeneousapplications.

� We establish the importance of low power SoCs forfuture of high-performance and heterogeneouscomputing through our evaluations.

We describe the OpenCL programming frameworkand the need for performing power/performance eva-luation of each optimization in Section 2. We defineand list different implementations, along with detaileddescriptions of each optimization considered in Section3. The experimental setup and platforms used for eva-luation are described in Section 4. Next, we evaluate thepower efficiency of application design methods andoptimizations on different platforms and input datasetsin Section 5. We summarize our findings in Section 6.We review related work, conclude the paper anddescribe future work in Sections 7, 8 and 9, respectively.

2 Background

2.1 Heterogeneous computing using OpenCL

Heterogeneous computing has gained popularity due tothe potential performance benefits available on discreteGPUs and shared memory APUs. OpenCL is an openstandard maintained by the Khronos group, and hasreceived the backing of a number of major graphicshardware vendors. An OpenCL program has host codethat executes on a single thread of a CPU. The hostcode is responsible for setting up data and schedulingexecution on OpenCL compute devices such as a GPUand/or CPU cores. OpenCL creates a context of workwhich establishes an unique command queue for eachof the OpenCL devices to be used (per GPU or CPU).This architecture is shown in Figure 1. The code thatexecutes on a compute device is called a kernel. A moredetailed discussion on implementing heterogeneous

Figure 1. OpenCL host device architecture.

2 The International Journal of High Performance Computing Applications

applications in OpenCL can be found in Gaster et al.(2011).

2.2 Power performance of optimization techniques

Power consumption of a GPU has been studied(Rofouei et al., 2008; Tsoi and Luk, 2011) to under-stand and develop power efficient GPU designs. Atpeak performance, the maximum power consumed by aCPU and the latest high-end GPU is 100 W and 200W, respectively. This highlights the need to carefullyconsider power/performance trade-offs of heteroge-neous applications when designing efficient applica-tions targeting GPUs.

Performance of heterogeneous applications has beenstudied by researchers using different optimization tech-niques (Jang et al., 2011). Large gains in performanceare reported due to the use of optimizations such asloop unrolling and coalesced memory accesses (Ryooet al., 2008). These evaluations do not account for thepower consumption of the application due to such opti-mizations. The study of optimizations from a power-aware perspective is the focus of this paper.

We evaluate different implementations of the FFT interms of power and performance. The FFT is an effi-cient version of the discrete Fourier transform (DFT)algorithm. It reduces the time complexity of the DFT toO(N logN ) from O(N2), where N is the number of inputdata points. A high performance FFT is key to meetperformance demands in applications such as imageprocessing, data compression, biomedical imaging andglobal positioning systems (Rivera et al., 2007; Stoneet al., 2008; Hassanieh et al., 2012). Different imple-mentations of the FFT algorithm have been primarilydeveloped to enhance the algorithm runtime, withoutconsidering power usage.

3 The fast Fourier transform

3.1 Definition and classification of the FFT

A discrete function f (x) of finite length can be describedas a set of potentially infinite sinusoidal functions. Thisrepresentation is known as the frequency domain repre-sentation of the function, F(u). The relation betweenthese two functions is represented by the DFT.

Fff (x)g=F(u)=1

N

XN�1

x= 0

f (x)W uxN ð1Þ

where WN = e�j2p

N , and is called the Twiddle Factor, andN is the number of input data points.

Equation (1) defines the forward DFT computationover a finite time signal f (x).

The FFT refers to a class of algorithms which uses adivide-and-conquer technique to efficiently compute aDFT of the input signal. For a one-dimensional (1D)

array of input data of size N , the FFT algorithm breaksthe array into a number of equal-sized sub-arrays andperforms computation on these sub-arrays. The processof splitting up the input sequence is called decimation,which can be done in two different ways. These twoapproaches cover all variations of FFT algorithms.

1. Decimation in time (DIT): Splits the N-point datasequence into two N

2-point data sequences, f1(N )

and f2(N ), for even and odd numbered input sam-ples, respectively.

2. Decimation in frequency (DIF): Splits N-point datainto two sequences, for first N

2and last N

2data

points, respectively.

Decimation exploits both symmetry and periodicityof the complex exponential WN = e

�j2p

N (Twiddle Factor)(Duhamel and Vetterli, 1990).

3.2 FFT implementations

The different FFT implementations are developed usingOpenCL for execution on heterogeneous devices. TheFFT implementations are compared based on theirdesign and execution characteristics when mapped toheterogeneous devices, as shown in Table 1.

� Multi-radix-single-call FFT:

The implementation of the multi-radix-single-call(MR-SC) FFT is based on the FFT implementationpresented in the AMD OpenCL SDK (AMD, n.d.).MR-SC is a single kernel call based DIF-FFT imple-mentation. Twiddle factors are stored as preprocessedconstants for efficient computation. The applicationuses local memory and global memory in its basicdesign.

� Stockham FFT:

The Stockham FFT, is a variation of the DIT-FFTalgorithm which does not use bit-reversal. Bit-reversal

Table 1. Comparison of FFT implementations according totheir design attributes.

F

and

Ukidave et al. 3

in DIT algorithms is used to reorder data for each stageof compute, to perform a correct transform. Similar tothe MR-SC FFT implementation, the Stockham-FFT isa single-kernel implementation that utilizes the globalmemory of the device.

� Multi-radix-multi-call FFT:

The multi-radix-multi-call (MR-MC) FFT is a multi-radix implementation of DIF-FFT. Such implementa-tions use combinations of kernels of different radix size.We consider an example with an input size of 1024 datapoints. The number of stages required to compute theFFT is log (N ), i.e. 10. Using the MR-MC algorithm,radix-8 kernels are launched three times and radix-2kernel is launched once, to complete the 10 stage com-putation. A traditional radix-2 would take 10 kernellaunches to accomplish the computation.

MR-MC implementation uses radix-8, radix-4 andradix-2 kernels for computation. The MR-MC’s FFTstructure uses global memory to store input data on theheterogeneous device. Twiddle factors are calculatedduring kernel execution.

� Apple OpenCL FFT :

Apple’s OpenCL FFT is based on the open-sourceDIF-FFT implementation provided by Apple Inc.(Apple, n.d.). The implementation is similar in struc-ture to the MR-MC FFT, but also utilizes a radix-16computation kernel. The Twiddle Factor computationis done during kernel execution. Apple’s FFT uses theglobal memory of the heterogeneous device.

In our FFT computations, we have consistently usedonly single-precision complex inputs, since some of ourtested platforms provide only IEEE single-precisioncompliance (e.g. the Intel Ivy Bridge GPU). We haveutilized generic implementations of our algorithms,which can easily be extended to double-precision forcompliant devices.

3.3 Optimization techniques used on FFTimplementations

Unoptimized code, and the resulting power consump-tion, degrades the power/performance of the applica-tion when run on a GPU. To address this issue, weexplore different optimization techniques to improveFFT efficiency.

The different optimization techniques can be charac-terized in four sets S0, S1, S2 and S3. Details of theoptimizations included in each set are described asfollows:

� Set S0: Set S0 does not perform any optimizationson the applications. The implementations are

unmodified and provide an ‘‘out-of-the-box’’ levelof analysis.

� Set S1: Traditional loop-unrolling mechanismsused on GPU applications are used in Set S1 (Ryooet al., 2008). Unrolled loops in kernels for GPGPUcomputation also result in more compute per workitem contributing to higher utilization of resources.Stockham FFT implements loops in its structureand hence was modified for loop unrolling. Themain compute loop was unrolled for four levels.The kernel input data points are modified to pro-duce coalesced memory accesses to/from the globalmemory of the GPU (Jang et al., 2011).

� Set S2: S2 set of optimizations use special data typessuch as float2, float4 and float8 provided by theOpenCL specification for input data accesses. Theinput and output data variables constitute of 64% ofthe program variables. Such input and output vari-ables were modified to use the specialized data types.The per-thread compute increases as the input datapoints increase due to the use of float2, float4 andfloat8 variables. A float and float2 combination withproper indexing was used as a technique to coalescememory accesses in the S1 optimization sets. Thedata transforms were applied to all the FFT imple-mentations. This increases the effectiveness of coa-lesced memory accesses and improves spatial locality.

� Set S3: S3 optimizations modify the applications touse local memory to store input data points.Applications fetch data from global memory andstore it in the local memory of the GPU. The FFTcompute takes place over data placed in local mem-ory of the device. Stage overlapping is also imple-mented in the S3 set. Stage overlapping allowsmulti-radix FFTs to overlap stages of execution ofdifferent radix kernels. Stage overlapping is used,along with local memory optimizations.

Each set is evaluated in terms of execution perfor-mance and power consumption. Table 2 summarizes

Table 2. Optimization sets describing different optimizationtechniques for FFT applications.


the different optimization sets and techniques used in aparticular set.

4 Experimental setup and methodology

Next, we describe the platforms used for evaluation ofpower and execution performance. We also describe themethodology employed for power measurements forthe different platforms used.

4.1 Platforms for evaluation

4.1.1 High power GPUs and APUs. We perform our evalua-tion of power and execution performance optimizationsof FFTs on a wide range of heterogeneous devices. Theevaluation with discrete GPUs uses an AMD SouthernIslands family GPU (Radeon HD 7770) along with aNVIDIA Kepler family GPU (GTX 680). Experimentswith shared memory using APUs include AMD’sFusion A8 processor and Intel’s Ivy Bridge Core i7 pro-cessor. Fused architectures can be utilized to extractperformance benefits from HPC applications such asthe FFT (Spafford et al., 2012). Both of the APU plat-forms include a quad core x86 CPU and a GPU on thesame die. The architectural details of each platformused for evaluation are provided in Table 3.

4.1.2 Lower power SoCs. We evaluate the power and exe-cution performance of different software designmechanisms on modern low power SoCs. TheQualcomm Snapdragon S4 Pro APQ8064 MDP(Mobile Development Tablet) is used for the evalua-tion. The Snapdragon S4 consists of a quad core ARMCPU (1.9Ghz) and Qualcomm Adreno 320 GPU withOpenCL 1.2 compliance and TDP of 5W (BSquare,n.d.).

4.2 Measurement of power consumption

The setup used for measurement of power on discreteGPUs and shared memory APUs is shown in Figure 2.GPUs provide a number of thermal sensors for tem-perature regulation and to record the power utilizationof the device. But the access to such sensors is notexposed to the user for measuring power and energyconsumption on the GPU. Hence, we present a metho-dology to explicitly measure power on GPUs. The dis-crete GPU device is powered using an external powersupply (FSP Booster X5). The PCI express slot of theGPU draws very small quantities of power (Collangeet al., 2009). The major source of power to the GPU isthe external power supply unit. A power meter isattached to the external power supply to record thereadings. We utilize the methodology for power con-sumption measurement developed in our own previouswork (Ukidave et al., 2013). Shared memory APUshave the graphics processor fused on the same die withthe CPU. Thus, power measurements of only the GPUon an APU device are not possible due to its closelycoupled structure. The power is measured across theAPU chip using test points provided on the mother-board. The entire motherboard power was not mea-sured for the APU power measurements. The powerconsumption measurements for the shared memoryAPUs are repeated many times to ensure consistency ofresults. The fluctuations recorded in power for repeatedmeasurements are less than 3% on average for all thetest platforms. Discrete GPUs are not present on thesystem during power measurements on an APU. Powermeasurements on the Qualcomm Snapdragon SoCtablets were done using direct power supply without

Table 3. Relevant architectural details of platforms used forevaluation (Nickolls and Dally, 2010; Boudier and Sellers, 2011;Mantor and Houston, 2011; Damaraju et al., 2012; NVIDIA,2012).

g

d

csr

i b

D p p p

b

b

meter

Figure 2. Power measurement setup for (a) Discrete GPU and(b) shared memory APU.

Ukidave et al. 5

the battery for the tablet. The display and other periph-erals on the tablet device were switched off.

5 Evaluation results

Next, we discuss the results of our power/performanceevaluations on the different platforms described inSection 4.1.

5.1 Baseline results for power and executionperformance

We first evaluate the execution performance and powerconsumption of our FFT implementations without apply-ing optimizations. Vendor-provided FFT implementa-tions are used as baselines for our evaluation on eachplatform. clAmdFFT, an OpenCL based FFT implemen-tation from AMD, is used as a baseline on AMDRadeon7770 GPU and AMD Fusion A8 APU (AMD, n.d.). TheCUDA based cuFFT provided by Nvidia is used as thebaseline for the GTX 680 GPU (Nvidia, 2010). Due tothe lack of availability of a vendor-provided FFT imple-mentation for Intel Ivy Bridge APU and QualcommSnapdragon S4, FFTW is used as a baseline for evalua-tion (Frigo and Johnson, 2005). FFTW is a highly opti-mized FFT benchmark for multicore CPUs, and does notutilize the GPU on the device.

The Figure 3 shows the execution performance ofthe FFT implementations on each platform. Executionperformance is evaluated using three input sets whilevarying sizes (i.e. 64K, 1M and 2M data points). Eachdata point is a single-precision complex number. Theexecution performance evaluation for the low powerSnapdragon SoC is shown in Figure 5(a), below.

Using unoptimized versions of the FFTs results inmediocre performance. The Stockham FFT baselineproduces poor performance due to the loop structurepresent in the application kernel.MR-SC FFT performs

efficiently and within 25% of the baseline on the AMDRadeon 7770 and within 35% of the baseline on NvidiaGTX 680. Effective use of local memory enhances theperformance of MR-SC on GPU devices. The limitedresources on the graphic processor of the APU contri-butes to lower performance, as compared to the discreteGPUs. But, the performance trend of the FFT imple-mentations on APUs is similar to those on GPUs. Thegraphic device on the Snapdragon SoC shows a 1:8 3

greater execution performance when compared to theIntel Ivy Bridge APU. The performance trend of theFFTs on SoC is similar to the AMD Fusion APU.

Power consumption of an application on differentdevices can be attributed to the structure of the algo-rithm and the architecture of the device. The power/performance is calculated as given below.

Power-performance(GFLOPS=Watt)

=5 3 N log2 N

Time for execution (s)3Power (Watts)

ð2Þ

where N is the number of input data points.The power/performance of the FFTs on discrete GPUsis shown in Figure 4.

Applications such as MR-SC, MR-MC and AppleFFT, which obtain good execution performance, yieldbetter power/performance. The higher power consump-tion of the Nvidia GPU reduces the power efficiency ofthe applications. The average power efficiency of theFFTs on the Nvidia GPU is 2 GFlops/Watt versus 4.5GFlops/Watt on the AMD GPU. Similar power/per-formance characteristics are seen on the APU and SoCdevices. The lower compute capabilities of the APUsalso produce a sharp decrease in the performance/power(GFlops/Watt) ratio. The low power consump-tion (1.1W–3.9W) of the Snapdragon S4 SoC is respon-sible for a superior performance/power ratio, as seen inFigure 5(b).

Figure 3. Execution performance of FFTs on all platforms. The platform specific baselines are cuFFT, clAmdFFT and FFTW.


5.2 Application level analysis of FFT implementations

FFT implementations can be designed using a singleradix approach or a multiple radix approach.Implementations such as Stockham use a single radixto form the computation, whereas FFTs such as MR-MC use radix-2, radix-4 and radix-8 together to formthe computation. Figure 6 shows the evaluation of theeffects of different radix combinations on power con-sumption for each implementation. The power con-sumed is normalized to the highest power-consumingFFT implementation on that platform. Figures 6(a),(b) and (c) show the evaluation on APUs, discreteGPUs and low power SoC, respectively.

FFT implementations with multiple radix combina-tions consume less power as compared to the singleradix implementations. A single radix computationrequires a large number of passes to complete the com-putation as compared with the multiple radix basedcomputations. The single radix computation increasessynchronization and global memory accesses for theGPU. The multiple radix combination reduces thenumber of passes to complete the computation. Thisresults in lower power consumption due to reduction inmemory accesses and synchronization requirements.

Multiple radix solutions require less stages to compute.

This reduces the load on the compute resources and

thus reduces the power consumption for the FFTs. This

characteristic is observed on discrete GPUs, APUs and

the SoCs. Our evaluation provides insight into the rela-

tionship between the power consumption of an FFT

implementation and the radix combinations chosen.Different radix combinations can be implemented

for FFTs using single or multiple kernel calls. We mea-

sure the power consumption of the FFT implementa-

tions with respect to the number of kernel calls

executed. FFTs such as Apple and MR-MC invoke

multiple kernel calls for execution, as reported in

Table 1. The evaluation of these FFTs on discrete

GPUs is shown in Figure 7(a), and the same for

shared-memory APUs is shown in Figure 7(b).The overhead of the kernel launch affects the power

consumption on both device types, as seen from

Figure 7. To study the effect of kernel launch on power

consumption, we measured power of the single most

compute-heavy kernel. From Figure 7, we see that

power consumption of this single compute-heavy ker-

nel shows smaller variations across different devices.

Multiple calls involve the use of different radix size

Figure 4. Power performance of FFTs (GFlops/Watts) on all platforms. The powerperformance of the platform specific baselines(cuFFT, clAmdFFT and FFTW) is also shown.

Figure 5. (a) Execution performance of FFTs (GFlops) on Qualcomm Snapdragon S4. (b) The power-performance of QualcommSnapdragon S4 against baseline (FFTW) for all FFT implementations.

Ukidave et al. 7

kernels, which keep the command queue active andscheduling unit busy for the entire computation. Thisaffects the power consumption on the GPUs andAPUs. The variation in power for the single computeheavy kernel is observed as 17% and 12% on discreteGPUs and APUs respectively.

5.3 Effect of memory-based optimizations on powerconsumption

Optimizations in sets S1 and S2 change the globalmemory access pattern of the applications. S1 modifiesthe FFT kernels to perform coalesced memory accesses.The use of coalesced global memory access improvesthe throughput of an application, as shown inFigure 8(a). The improvement in throughput is one of

the important factors which enhances the performance(GFlops) of FFTs, as shown in Figure 8(b). Coalescedaccesses are handled using dedicated hardware in thedata path of a GPU. This increases the power con-sumption of the device when performing coalescedaccesses due to higher utilization of those hardwareunits. Figure 8(b) shows the increase in power, accom-panied by the improvement in performance for boththe AMD Radeon GPU and the AMD Fusion APU.The Qualcomm Snapdragon SoC shows a 16%increase in power consumption for the S1 optimiza-tions, with a 23% increase in execution performance.The power increase on the SoC is not as large asobserved on the APUs and the discrete GPUs, and isonly 24% less than the documented TDP (5 W) of thedevice. Memory transactions produced by the S1 opti-mizations increase the power consumption by 32% onaverage, as seen in Figure 8(b).

OpenCL implementations can leverage float2, float4,float8 data-types to maintain data in contiguouschunks of memory. Accessing data using these datatypes can increase the number of coalesced memoryaccesses for a kernel. FFTs kernels are modified toaccess data using each of these data-types. The perfor-mance and power consumption of these data transfor-mations is shown in Figure 9. Performance increases asthe number of coalesced memory accesses increases, asshown in Figure 9. The use of this data transform alsoincreases the amount of work done by each work item.Conversion from float to float8 increases the work of athread by up to eight times due to number of dataitems retrieved in a single access. In such cases, the syn-chronization overhead encountered after each stage ofcompute increases due to the increase in the number ofdata points processed per work item. This results inincreased utilization of scheduling resources on theGPU. Increased memory transactions also increasepower consumption on the memory bus. The computeresources on the SoC are limited (to four parallel com-pute units), as compared to the discrete GPUs. Thiscoalesced memory access results in increased utilizationof the resources causing increased power consumptionon SoCs. Thus, we can observe an increase in powerutilization for the float4 and float8 data transforms, asshown in Figure 9.

5.4 Local memory and stage overlapping effects onpower consumption

Optimizations defined in the S3 category modify theFFT kernels to utilize local memory of the GPU. Useof local memory helps to hide global memory accesslatency. This improves throughput of the system, result-ing in effective improvements in performance. S3 opti-mizations also perform stage overlapping on kernels.The FFT implementations with multiple kernel calls,

Figure 6. Evaluation of the use of different radix combinationsin FFTs on power consumption trends of the implementations.The power trends are normalized to the highest powerconsuming FFT implementation on the respective platforms. Theradix combinations used for each implementation are alsoprovided. Evaluation is performed for block size N = 8 M for(a) APUs, (b) GPUs and (c) the Qualcomm SoC.


such as Apple FFT and MR-MC FFT, are modified touse a stage-overlap. All of the stages to be computed bya particular radix-n kernel are completed in single ker-nel call. This reduces kernel call overhead.

Local memory is used for storing input data for eachwork-item when computing a stage of an FFT. Theimprovement in execution performance due to the S3

optimizations is seen in Figure 10(b). The averageincrease in performance on the AMD Radeon 7770GPU employing the S3 optimization is 57% over thebaseline (S0) implementation. The effective increase inlocal memory accesses from S2 to S3 is shown in Figure10(a) for the AMD Radeon 7770 GPU. The MR-SCFFT uses local memory in its unmodified

Number of kernel calls (secondary axis) Number of kernel calls (secondary axis)

Num

ber o

f ker

nel c

alls

Num

ber o

f ker

nel c

alls

Figure 7. Evaluation of power trend for multiple kernel calls on (a) discrete GPU and (b) shared memory APU.

Figure 8. (a) Effect of S1 optimizations on memory throughput, as compared to S0 optimizations on Discrete GPUs. (b) Powerperformance of S1 optimizations over S0 for FFTs.

Ukidave et al. 9

implementation, as shown in Table 1. S2 optimizationsinclude local memory accesses for supporting data vari-ables required for compute, but do not consider theinput data points. As observed in Figure 10, the powerconsumption of the FFTs does not vary much when weemploy the S3 optimizations on the GPUs. However,the power consumption on the AMD Fusion APUincreases due to local memory usage and stage overlap-ping. The Qualcomm Snapdragon SoC shows a smallincrease in power consumption but provides more thana 30% increase in execution performance on averagefor S3 optimizations.

The size constraint of local memory in APUsrequires changes in kernel structure to achieve high uti-lization. The number of data points handled by a

particular work item has to be reduced to preventexhaustion of local memory. Due to the limited com-pute capabilities of the APU (as compared to discreteGPUs), the number of wavefronts formed increases.The increase in the number of wavefronts adds addi-tional load to the wavefront scheduling unit. Thus, thegraphics processor on the APU is oversubscribed, andincreases the power consumption across all the archi-tectural units including the scheduling unit, shaderpipeline and memory bus. Thus, an effective rise inpower consumption of 23% average is seen for APUsdue to local memory usage. The SoC also faces a simi-lar power increase due to the over-subscription of GPUresources. But, the increase in power consumption onthe SoC is not as drastic as the APU.

float float 2 float 4 float 8

Figure 9. Effect of the S2 optimization based Data transformations on execution performance and power consumption for theNvidia GTX 680 GPU, input size = 2 M.

GFlops (S0)GFlops (S3)

Figure 10. (a) Memory access distribution pattern for the AMD Radeon 7770 GPU over S2 and S3 optimization sets. (b) Effect ofthe S3 optimizations as compared to S0 for execution performance and power consumption, input size = 2 M.


5.5 Power consumption of different architecturalfactors

We also analyze the relationship between differentarchitectural factors of the GPU and power consump-tion. The evaluation is performed only considering theS3 optimizations, which provide the largest gains versusthe S1 and S2 set of optimizations. We analyzed thememory unit stalls and ALU utilization for the AMDRadeon GPU and the AMD Fusion APU. The evalua-tion is shown in Figure 11. We can observe a reductionin power consumption for applications with lowermemory unit stall rate. This can be attributed to thereduction of the number of in-flight memory accessesand memory dependencies, which differ across differentFFT implementations. Thus, as the number ofmemory-unit stalls reduce, the power consumptionreduces for the particular application on the GPU.Lower memory stalls are due to the reduced number ofexternal memory accesses. Each memory access costspower consumption which is effectively lowered withlower stalls. Similar power trends are seen on the AMDAPU, but the impact on power is lower due to thehigher idle-power consumption of APUs. Stage over-lapping is most effective when performed with effectivelocal memory utilization. We observe a higher ALUutilization for both GPUs and APUs. The MR-MCand Apple FFTs are multiple kernel call implementa-tions which show maximum ALU utilization on stage

overlapping. As seen from Figure 11, the ALU utiliza-tion does not overly impact power consumption. But, alarger ALU utilization improves the power/perfor-mance (GFlops/Watt) of an application. The analysisprovided for architectural features cannot be per-formed for the Qualcomm Snapdragon SoCs due tothe lack for compute level profilers from the vendor.

5.6 Energy consumption due to optimizationtechniques

We analyze the energy consumption of FFTs for all theimplemented optimization techniques. Studying energyconsumption provides an unified metric to understandboth power consumption and execution performanceof an application. The energy consumption of FFTsacross all optimization sets is shown in Figure 12. Thevariation in energy consumption of our optimizationsS1, S2 and S3 is 11% on average. The increase in powerconsumption is the cost for improved performance.Energy consumption analysis captures this trade-off.An increase in energy consumption by 13% on averageis seen for S2 optimizations, as compared against theS1 and S3 optimizations. The low power SoC providesa energy efficient solution for high performance com-puting as seen in Figure 12. The energy consumption ofthe Qualcomm Snapdragon is 65% less than discreteGPUs accompanied by a 63% slowdown in execution

Figure 11. Power Consumption trend due to different architectural factors for the AMD Radeon 7770 GPU using S3optimizations, input size = 2 M.

Figure 12. Energy consumption (in Joules) of FFTs across different optimization sets for AMD Radeon 7770 GPU and AMD FusionAPU, input size = 2 M.

Ukidave et al. 11

performance. This power consumption to executionperformance trade-off is presented in Figure 12. Thiscan help developers to choose the most appropriateoptimization depending on the algorithmic structure ofthe application.

6 Results summary

The discussion in the previous sections highlights thepower consumption trends of the FFTs with respect tothe radix combinations used in their designs. We findthat the implementations using multiple radix combina-tions consumed less power. To understand the relation-ship between OpenCL implementations and powerconsumption, we conducted further studies to observethe power overhead imposed by additional kernel calls.As expected, this showed that the number of kernelcalls affects the power consumed. FFTs with single ker-nel call and utilizing multiple radix combinations arejudged as the most power/performance efficient accord-ing to the evaluation.

We observed the power/performance trade-off forthe selected set of optimizations. Developers can choosea set of optimizations to achieve better power efficiency,compute efficiency or both. As seen in Section 5.4, opti-mization techniques such as stage overlapping and localmemory usage can be beneficial to power/performanceof the application when executing on GPUs. Coalescedmemory access and loop unrolling help to improve exe-cution performance, but can consume more power onGPUs, as observed in Section 5.3. Data transforma-tions provide better performance, but can result inincreased synchronization requirements, and can leadto increased power consumption. Exclusive use of theS1 and S2 optimizations can yield an improvement inperformance at the cost of increase in energy consump-tion. Hence, data transformations must be carefullyimplemented to obtain optimal power/performance.

As observed in Sections 5.3 and 5.4, the impact ofthe selected optimization techniques such as coalescedmemory accesses, loop unrolling, data transformationsand software pipelining is low for APUs. Applicationsoversubscribe the limited compute resources of theAPU and increase power consumption of different unitsin the APU. Local memory usage is the most effectivepower/performance optimization on APUs. Developersshould carefully choose optimizations for efficient exe-cution on APUs. The low power QualcommSnapdragon SoC exhibits power efficient behavior forthe S3 optimizations. The coalesced memory accesses ofthe S1 optimizations cause an increase in power con-sumption of the SoC. As observed in Section 5.6, theQualcomm Snapdragon shows highly energy efficientbehavior but with low execution performance.

Energy consumption of different optimizations isevaluated in Section 5.6. The selected optimizationspossess similar energy characteristics, but the resultsvary across different FFTs. Table 4 summarizes power-efficiency and performance-efficiency for this study.High efficiency indicates that an optimization tech-nique provides 40% or more improvement on averageover the baseline platform. Moderately efficient optimi-zations indicate a 10–40% improvement on average.Optimizations which provide less than a 10% improve-ment on average for power or execution performanceare categorized as less efficient. The optimizationsdescribed in the study are generic and can be appliedon any application developed for GPUs. The FFTsused in this study are strong proponents of the class ofdata-parallel applications designed for execution onaccelerator devices. Hence, the results obtained fromthis study can be generalized for any data-parallelapplication executing on the GPU and APU.Developers can utilize the combination of the differentoptimizations (S1, S2 and S3) to best suit the require-ments of their application.

Table 4. Summary of optimization techniques categorized according to their power and performance efficiency. Efficiencyquantified by percentage improvement in each category.

(GFlops/watt)

u

u

o

–


7 Related work

The importance of performance for heterogeneousapplications has led to a number of previous high-quality studies. Ryoo et al. (2008) and et al. Jang et al.(2011) investigate a range of optimizations such as loopunrolling, utilization of ALUs, fetch bandwidth, andthread usage. Their work also provides an in-depthexplanation of different data transformation techniquesand characterizes access patterns in kernels. Williamset al. (2009) present a visual computational modelwhich characterizes the execution domain of the appli-cation as memory-bound or compute-bound. The workprovides a deeper understanding of application kernelbehavior in heterogeneous development environments.

Power consumption on GPUs has been studied usingboth experimental methods, where power is measuredon running systems, and using power models. Usingpower models for GPUs (Hong and Kim, 2009;Nagasaka et al., 2010; Kasichayanula et al., 2012)allows an architect to explore the design space, whilebeing aware of energy consumption. Power models arecommonly validated against experiments usingmicrobenchmarks (Suda and Ren, 2009; Kasichayanulaet al., 2012). Power modeling research is complimentaryto our work since our goal is to evaluate power/perfor-mance for devices such as the Ivy Bridge APU forwhich power models are unavailable.

The importance of the FFT algorithm in DSP andHPC applications has resulted in many previous studiesthat focused on performance optimization of the under-lying algorithm. FFTW (Frigo and Johnson, 2005) is ahighly optimized FFT implementation for CPUs.Performance optimization of FFTs for GPU platformsincludes studies by Govindaraju et al. (2008), Nukadaet al. (2008) and Volkov and Kazian (2008). Thesestudies report on substantial performance improve-ments, but are restricted to Nvidia platforms since theypredate OpenCL and heterogeneous devices. The popu-larity of GPUs in the HPC space has also led to theinclusion of the FFT in benchmark suites such asSHOC (Danalis et al., 2010).

GPUs from Nvidia provide specialized hardwareand driver support for power measurement in theNvidia Management Library (NVML) (NVIDIA,n.d.). This methodology provides finer-grained mea-surements (Kasichayanula et al., 2012). However dueto the large number of platforms from multiple vendorsthat were used in this paper, we decided to utilize con-sistent and vendor-neutral interfaces to measure power.

8 Conclusion

In this paper, we studied different optimization tech-niques for heterogeneous applications, and their effectson power and execution performance. We studied theseeffects for different implementations of an FFT for a

number of discrete and fused platforms. We studiedoptimizations including coalesced memory accesses,loop unrolling, local memory usage, data transforma-tions and stage overlapping.

We investigated the effects of design attributes suchas radix combinations used in an FFT implementationon power consumption. We find an 35% decrease inpower on APUs and up to a 55% decrease in power ondiscrete GPUs for FFTs that employ multiple radixcombinations. The choice of a particular FFT algo-rithm can thus affect the power consumption for largeinput sizes. We also observed a strong relationshipbetween power consumption and the number of kernelcalls for the FFT implementations.

Our results varied by 55.3% in terms of power/per-formance for selected FFTs algorithms run on discreteGPUs and up to an 84% variation run on APUs.Execution performance speedups approaching 1:8 3

were obtained for each class of optimization studied.We observed an increase in power for the S1 and S2optimizations of up to 28% on AMD GPUs, 11% onAMD APUs and 16% on the Qualcomm SnapdragonSoC. The S3-level of optimizations were the mostpower/compute efficient, yielding more than a 2 3 per-formance speedup with less than a 2% increase inpower consumption on discrete GPUs. The S3 optimi-zations consume 10.5% less energy compared to the S1and S2 optimizations. The algorithmic structure of theStockham FFT and MR-SC FFT consume 20% moreenergy compared to the other FFT implementationsacross all optimization sets.

We have also explored the architectural factorsaffecting power consumption on GPUs. This prelimi-nary evaluation shows a strong relationship betweenpower consumption and memory unit stalls on GPUs.We also found that ALU utilization did not seem todirectly affect the power consumption of the device.

The energy efficiency of the Qualcomm SnapdragonSoC can be leveraged for use in scientific computingclusters. A cluster with several SoC nodes can providelarger compute power along with CPUs and GPUs.The OpenCL compliance on the SoC device can be har-nessed for speedups for heterogeneous HPC applica-tions. Clusters designed out of SoCs can operate as alow power node when used along side clusters consist-ing of discrete GPUs.

We believe that our work can help researchers, pro-grammers and developers identify the potential of dif-ferent kernel choices from a power-aware perspective.Our analysis of optimization techniques can easily beextended to study any data-parallel algorithms designedfor heterogeneous devices. Other fundamental algo-rithms such as sorting, where we have a choice of differ-ent algorithms and sorting networks to choose from,can be evaluated for power/performance. This studyhelps with reasoning about the the power trade-offs

Ukidave et al. 13

associated with different optimizations and choosingthe most appropriate based on application require-ments. Our evaluations help developers choose a cor-rect combination of devices such as GPUs, APUs andSoCs for an energy efficient cluster design.

9 Future work

Our present research in power consumption on hetero-geneous applications and devices opens up many ave-nues for further exploration. We plan to extend ourwork by analyzing the power consumption factors onmultiple OpenCL devices at the level of clusters analy-sis. We want to extend our work to study microarchi-tectural features of the heterogeneous devices whichmost affect power consumption. We plan to pursue thisdirection by using GPU simulators such as GPGPUsim(Bakhoda et al., 2009) and Multi2Sim (Ubal et al.,2012).

Acknowledgment

We would like to thank Chris Klob from Qualcomm for pro-viding the Snapdragon test platforms.

Funding

This work was supported by Analog Devices Inc, AMD,Nvidia and Qualcomm, and by an National ScienceFoundation (NSF) ERC Innovation Award (grant numberEEC-0946463) and an NSF CNS Award.

References

AMD (n.d.) AMD SDK (formerly ATI Stream). Available at:

http://developer.amd.com/gpu/AMDAPPSDK/.AMD (n.d.) clAmdfft, OpenCL FFT library from AMD.

Available at: http://www.bealto.com/gpu-fft.html.Apple (n.d.) Apple implementation of FFT using OpenCL.

Available at: https://developer.apple.com/library/mac/

samplecode/OpenCL_FFT/Introduction/Intro.htmlBakhoda A, Yuan GL, Fung WWL, et al. (2009) Analyzing

CUDA workloads using a detailed GPU simulator. In:

Proceedings of 2009 IEEE international symposium on per-

formance analysis of systems and software, Boston, MA,

26–28 April 2009, pp.163–174.Boudier P and Sellers G (2011) Memory system on Fusion

APUs. In: AMD Fusion Developer Summit, Bellevue,

Washington, USA, 13–16 June 2011.BSquare (n.d.) Qualcomm Snapdragon S4 Pro APQ8064

MDP tablet datasheet. Available at: http://www.bsquare.

com/Documents/APQ8064%20MDP%20Tablet.pdfCollange S, Defour D and Tisserand A (2009) Power con-

sumption of GPUs from a software perspective. In: Com-

putational Science—ICCS 2009 (Lecture Notes in

Computer Science, Vol. 5544). Berlin; Heidelberg, Ger-

many: Springer, pp.914–923.Damaraju S, George V, Jahagirdar S, et al. (2012) A 22 nm

IA multi-CPU and GPU system-on-chip. In: Proceedings

of 2012 IEEE international solid-state circuits conference,

San Francisco, CA, 19–23 February 2012, pp.56–57.Danalis A, Marin G, McCurdy C, et al. (2010) The Scalable

HeterOgeneous Computing (SHOC) Benchmark Suite.

Oak Ridge, TN: Oak Ridge National Laboratory. Distrib-

uted by the Office of Scientific and Technical Information,

U.S. Dept. of Energy.Deschizeaux B and Blanc J (2007) Imaging earth’s subsurface

using CUDA. GPU Gems 3: 831–850.Duhamel P and Vetterli M (1990) Fast Fourier transforms: A

tutorial review and a state of the art. Signal Processing

19(4): 259–299. doi: 10.1016/0165-1684(90)90158-U.Frigo M and Johnson S (2005) The Design and Implementa-

tion of FFTW3. Proceedings of the IEEE 93(2): 216–231.

doi: 10.1109/JPROC.2004.840301.Gaster B, Howes L, Kaeli DR, et al. (2011) Heterogeneous

Computing with OpenCL. Waltham, MA: Morgan Kauf-

mann, 2011.Govindaraju N, Lloyd B, Dotsenko Y, et al. (2008) High per-

formance discrete Fourier transforms on graphics proces-

sors. In: Proceedings of SC 2008—International conference

for high performance computing, networking, storage and

analysis, Austin, TX, 15–21 November 2008, pp.1–12.Hassanieh H, Adib F, Katabi D, et al. (2012) Faster GPS via

the sparse Fourier transform. In: Proceedings of the 18th

annual international conference on mobile computing and

networking—MOBICOM ’12, Istanbul, Turkey, p.353–364.Hong S and Kim H (2009) An analytical model for a GPU

architecture with memory-level and thread-level paralle-

lism awareness. ACM SIGARCH Computer Architecture

News 37(3): 152. doi: 10.1145/1555815.1555775.Jang B, Do S, Pien H, et al. (2009) Architecture-aware opti-

mization targeting multithreaded stream computing. In:

Proceedings of 2nd workshop on general purpose processing

on graphics processing units—GPGPU-2, Washington DC,

USA, 8 March 2009, pp.62–70. doi: 10.1145/1513895.

1513903.Jang B, Schaa D, Mistry P, et al. (2011) Exploiting memory

access patterns to improve memory performance in data-

parallel architectures. IEEE Transactions on Parallel and Dis-

tributed Systems 22(1): 105–118. doi: 10.1109/TPDS.2010.107.Kasichayanula K, Terpstra D, Luszczek P, et al. (2012) Power

aware computing on GPUs. In: Proceedings of symposium

of application accelerators in high performance computing—

SAAHPC ‘12, Lemont, Illinois, USA, July 2012, pp.63–74.Liu Y, Zhang E and Shen X (2009) A cross-input adaptive

framework for GPU program optimizations. In: Proceed-

ings of IEEE international symposium on parallel and dis-

tributed processing (IPDPS 2009), Rome, Italy, May 2009,

pp.1–10.Mantor M and Houston M (2011) AMD graphic core next,

low power high performance graphics and parallel com-

pute. In: AMD Fusion Developer Summit, Bellevue,

Washington, USA, 13–16 June 2011.Nagasaka H, Maruyama N, Nukada A, et al. (2010) Statisti-

cal power modeling of GPU kernels using performance

counters. In: Proceedings of IEEE international conference

on green computing, Chicago, IL, 15–18 August 2010,

pp.115–122.Nickolls J and Dally WJ (2010) The GPU computing era.

IEEE Micro 30(2): 56–69. doi: 10.1109/MM.2010.41.


Nukada A, Ogata Y, Endo T, et al. (2008) Bandwidth inten-

sive 3-D FFT kernel for GPUs using CUDA. In: Proceed-

ings of SC 2008—International conference for high

performance computing, networking, storage and analysis,

Austin, TX, 15–21 November 2008, pp.1–11.Nvidia (2010) Cufft library. Available at: https://developer.

nvidia.com/cufft.Nvidia (2012) Whitepaper on NVIDIA GeForce GTX 680.

Technical report.Nvidia (n.d.) Nvidia management library (NVML) Available

at: http://developer.nvidia.com/cuda/nvidia-management-

library-nvml.Owens J, Houston M, Luebke D, et al. (2008) GPU comput-

ing. Proceedings of the IEEE 96(5): 879–899.Pharr M and Fernando R (2005) GPU Gems 2:Programming

Techniques For High-Performance Graphics And General-Pur-

pose Computation. Upper Saddle River, NJ: Addison-Wesley.Rivera D, Schaa D, Moffie M, et al. (2007) Exploring novel

parallelization technologies for 3-D imaging applications.

In: Proceedings of 19th international symposium on com-

puter architecture and high performance computing (SBAC-

PAD’07), Gramado, Brazil, October 2007, pp.26–33.Rofouei M, Stathopoulos T, Ryffel S, et al. (2008) Energy-

aware high performance computing with graphic process-

ing units. In: Proceedings of the 2008 conference on power

aware computing and systems, San Diego, California, USA,

December 2008. Available at: https://www.usenix.org/

conference/hotpower-08/energy-aware-high-performance-

computing-graphic-processing-units.Ryoo S, Rodrigues CI, Baghsorkhi SS, et al. (2008) Optimiza-

tion principles and application performance evaluation of

a multithreaded GPU using CUDA. In: Proceedings of the

13th ACM SIGPLAN symposium on principles and practice

of parallel programming (PPoPP ’08), Salt Lake City,

Utah, USA, February 2008, pp.73–82.Spafford KL, Meredith JS, Lee S, et al. (2012) The tradeoffs of

fused memory hierarchies in heterogeneous computing

architectures. In: Proceedings of the 9th conference on com-

puting frontiers (CF ’12), New York, May 2012, p.103–112.Stone S, Yi H, Haldar J, et al. (2008) How GPUs can improve

the quality of magnetic resonance imaging.Urbana 51: 61801.Suda R and Ren DQ (2009) Accurate measurements and pre-

cise modeling of power dissipation of CUDA kernels

toward power optimized high performance CPU-GPU

computing. In: 2009 international conference on parallel

and distributed computing, applications and technologies,

Hiroshima, Japan, December 2009, pp.432–438.Tsoi KH and Luk W (2011) Power profiling and optimization

for heterogeneous multi-core systems. ACM SIGARCH

Computer Architecture News 39(4): 8. doi: 10.1145/

2082156.2082159.Ubal R, Mistry P, Schaa D, et al. (2012) Multi2Sim: A simu-

lation framework for CPU-GPU computing. In: Proceed-

ings of the 21st international conference on parallel

architectures and compilation techniques, Minneapolis,

USA, September 2012, pp.335–344.Ukidave Y, Ziabari A, Mistry P, et al. (2013) Quantifying

energy efficiency of FFT on heterogeneous platforms. In:

Proceedings of international symposium on performance

analysis of systems and software (ISPASS 2013), Austin,

Texas, USA, April 2013, pp.235–244.

Volkov V and Kazian B (2008) Fitting FFT onto the G80

architecture. Available at: http://www.cs.berkeley.edu/

kubitron/courses/cs258-S08/projects/reports/project6_report.

pdf.Williams S, Waterman A and Patterson D (2009) Roofline:

An insightful visual performance model for multicore

architectures. Communications of the ACM 52(4): 65–76.Ye W, Vijaykrishnan N, Kandemir M, et al. (2000) The

design and use of simplepower. In: Proceedings of the 37th

conference on design automation (DAC ’00), New York,

June 2000, pp.340–345.

Author biographies

Yash Ukidave is a PhD candidate at NortheasternUniversity and a member of the NortheasternUniversity Computer Architecture Research Laboratory(NUCAR). He received his MS in ComputerEngineering from Northeastern University, Boston,USA, and his BS in Electronics Engineering from theUniversity of Mumbai, India. Yash has been workingon design of low power clusters for high performancecomputing applications using OpenCL and CUDA. Hiscurrent research focuses on power optimization onGPU architectures and design of efficient runtime sche-duling and management schemes for heterogeneousplatforms.

Amir Kavyan Ziabari received his MS in ComputerEngineering from University of Isfahan, Isfahan, Iran,and his BS in Computer Hardware from AmirKabirUniversity of Technology, Tehran, Iran. He is cur-rently a PhD candidate, studying ComputerArchitecture in Northeastern University, Boston, USA,where he is a member of Northeastern UniversityComputer Architecture Research Laboratory(NUCAR). His current research focuses on GPUarchitecture and design, and interconnection networksin heterogeneous architectures.

Perhaad Mistry works in AMD’s developer tools groupat the Boston Design Center focusing on debuggingtools for heterogeneous architectures. He is also a PhDcandidate at Northeastern University and a member ofthe Northeastern University Computer ArchitectureResearch Laboratory (NUCAR). He received a BS inElectronics Engineering from the University ofMumbai, India, and an MS in Computer Engineeringfrom Northeastern University in Boston, USA.Perhaad has been working on GPU architecture sinceCUDA 0.8. He has implemented medical imaging algo-rithms for GPGPU platforms and also worked onimplementing architecture-aware data structures forthe physics calculations in surgical simulators. Hispresent research focuses on the design of profiling toolsfor heterogeneous computing. He is presently based inBoston.

Ukidave et al. 15

Gunar Schirner holds PhD (2008) and MS (2005)degrees in electrical and computer engineering from theUniversity of California, Irvine. He is currently anAssistant Professor in Electrical and ComputerEngineering at Northeastern University. Prior to join-ing the Northeastern faculty, he was an assistant proj-ect scientist at the Center for Embedded ComputerSystems (CECS) at the University of California, Irvine.Professor Schirner also has five years of industry expe-rience at Alcatel (now Alcatel–Lucent) where hedesigned distributed embedded real-time software fortelecommunication products. His research interestsinclude embedded system modeling, system-leveldesign, and the synthesis of embedded software.

David Kaeli received a BS and PhD in ElectricalEngineering from Rutgers University, and an MS inComputer Engineering from Syracuse University. He is

the Associate Dean of Undergraduate Programs in theCollege of Engineering and a Full Processor on theECE faculty at Northeastern University, Boston, MA.He is the director of the Northeastern UniversityComputer Architecture Research Laboratory(NUCAR). Prior to joining Northeastern in 1993,Kaeli spent 12 years at IBM, the last 7 at the TJWatson Research Center, Yorktown Heights, USA. DrKaeli has published over 200 critically reviewed publi-cations, 7 books and 8 patents. His research spans arange of areas. from microarchitecture to back-endcomplers and database systems. His current researchtopics include information assurance, graphics proces-sors, virtualization, heterogeneous computing andmulti-layer reliability. He serves as the Chair of theIEEE Technical Committee on Computer Architecture.