IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND …ad3639/tgcn18.pdf · 2019-08-21 · IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, VOL. , NO., MONTH YEAR 3 CONFIDENTIAL

IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, VOL. , NO., MONTH YEAR 1

Energy Efficient Mapping of LTE-A PHY SignalProcessing Tasks on Microservers

Anup Das, Member, IEEE, Francky Catthoor, Fellow, IEEE, André Bourdoux, Member, IEEE,and Bert Gyselinckx, Member, IEEE

Abstract—Centralized radio access network (C-RAN) is anetwork architecture that is emerging as a key technology enablerfor 5G mobile networks as capacity demands for mobile trafficcontinue to proliferate. Essentially, C-RAN involves separatingthe remote radio heads (RRHs) from baseband units (BBUs) tobe processed in the cloud. A systematic design of C-RAN involvesmapping of individual baseband signal processing tasks to generalpurpose cloud infrastructures, such as the microserver to reducethe energy footprint. In this paper we start with mapping thelowest protocol stack, i.e., the physical layer (PHY), which ischaracterized by strict latency and dynamic data rate. To achievethis, we explore the use of machine intelligence for energy-efficient mapping of PHY signal processing on microservers. Fun-damental to this approach is (1) the use of principal componentanalysis to represent workload from multi-dimensional hardwareperformance statistics, demonstrating 99.88% correlation withthe critical PHY processing latency; and (2) the use of deeplearning to model latency and predict dynamic workload for on-demand resource allocation, resulting in up to 36% reduction inhardware usage. These principles are built into a cross-layer run-time framework, which adapts resource allocation in responseto time-varying data rate, guaranteeing latency and improvingenergy efficiency by up to 48% (average 28%).

Index Terms—Centralized radio access network (C-RAN),LTE-Advanced, Physical layer (PHY), Neural network, Deeplearning, Principal component analysis (PCA).

I. INTRODUCTION

TECHNOLOGICAL advances in smart phone togetherwith growing popularity of social applications have re-

sulted in proliferation of mobile data traffic over the pastfew decades [1], [2]. Centralized radio access network (RAN),Cloud-RAN or C-RAN [3] is a promising approach to supportenormous data traffic volume, by providing efficient networkinfrastructure, lower energy consumption, agile traffic man-agement and high reliability [4]. The basic concept behind C-RAN is as follows: a base station in a conventional networkarchitecture is split into two parts: remote radio heads (RRHs),which are distributed across sites and baseband units (BBUs)clustered together to be processed centrally in the cloud.

Several efforts have been invested recently to deploy C-RAN on commodity hardware [5]–[15]. Despite these ad-vances, C-RAN mapping on cloud infrastructures still remain

A. Das is with the Department of Electrical and Computer En-gineering, Drexel University, Philadelphia, PA, 19104 USA e-mail:[email protected].

F. Catthoor, A. Bourdoux and B. Gyselinckx are with IMEC, Kapeldreef75, 3001 Leuven, Belgium.

Manuscript received June 25, 2017; revised October 12, 2017, December20, 2017; accepted January 7, 2018.

a challenge both in terms of high energy footprint [16] and ser-vice availability, leading to data rate variability and occasionalcall drops [17]. Some of the existing research activities onenergy efficient predictable C-RAN mapping are ad-hoc and/orspecific to a given baseband architecture. Additionally, thefollowing gaps still exist, which we address in this work. First,most studies are focused on the application optimization fora given baseband architecture, without considering hardwareperformance capabilities. Second, workload often refer toRAN parameters such as physical resource blocks (PRBs)or dynamic data rate. Execution time or hardware utilization,used as proxy to represent workload (as in [10]–[13]), are notaccurate as we demonstrate in this work.

We propose a machine learning-based adaptive frameworkfor energy-efficient mapping of physical layer (PHY) signalprocessing tasks on microserver hardware. This frameworkadapts resource allocation at run-time in response to time-varying data rate, while guaranteeing latency specified inthe Long Term Evolution-Advanced (LTE-A) standard [18].Our approach interacts with the application and the hardwareto map PHY signal processing on microserver, improvingenergy efficiency. The framework brings together principalcomponent analysis (PCA) [19] for workload representation,echo state networks [20] for workload prediction and deepfeedforward neural networks [21] for PHY latency modeling.Experiments with open source PHY software [22] on an Intelmicroserver [23] demonstrate that the proposed frameworkreduces resource requirement by up to 36%, improving energyefficiency by an average 28%, over existing practices.

Following are our key contributions.

• machine learning framework to map PHY signal process-ing on microservers;

• dimensionality reduction using principal componentanalysis to represent hardware workload;

• hardware workload prediction using deep neural net-work to adapt hardware resources in response to time-varying data rate; and

• application latency modeling using deep learning toimprove energy efficiency.

The remainder of this paper is organized as follows. Lit-erature survey is presented in Section II, followed by moti-vation in Section III. Introduction to C-RAN is provided inSection IV. Our machine learning-based run-time mappingframework is discussed in Section V. Results and discussionsare presented next in Section VI and conclusion in Section VII.


TABLE I: Summary of related works.

Techniques Platform Workload Allocation Learning[5]–[9] Dedicated RAN param. Conservative ×

[10], [11] Single core Latency Conservative ×

[24] Multi-core RAN param. Aggressive ×

[25] Simulation RAN param. Reactive√

[12]–[14] Single-core CPU utilization Reactive ×

Proposed Multi-core Hardware Proactive√

II. RELATED WORKS

Mapping of software defined radio is proposed in [5],[6] for heterogeneous architecture with co-processors. A fulldigital baseband processing hardware is proposed in [8]. Otherapproaches involve dedicated baseband processing using novelhardware [7], [9]. In contrast to these approaches, we usegeneral-purpose hardware. In [10], [11], mapping of LTEeNoteB on single-core system is proposed. In [11], workloadanalysis is based on PHY application processing latency. Anelastic resource allocation framework is proposed in [24]. Adeep reinforcement learning framework is studied in [25].These works suffers from the limitations discussed in SectionI. In [26] computing resources are minimized by leveraging onradio access unit’s sleep scheduling and virtual machine con-solidation. A radio-head co-operation technique is proposed in[27] to minimize the energy consumption of C-RAN. Theseapproaches are orthogonal to our work.

Approaches closest to ours are [12]–[14]. In [14], a proto-type is implemented for mapping LTE medium access con-trol (MAC) signal processing on Intel-based single server.Compared to this, our approach starts with the PHY, whichrequires stricter latency and a dynamic data rate; we alsoaddress the three limitations discussed in Section I. In [12],LTE-A PHY signal processing tasks are mapped on a singlecore. CPU utilization is used to determine the mapping, whichis less accurate. Our approach uses hardware statistics torepresent workload. A hardware performance driven approachis proposed in [13] to map LTE subframes on hardware withoutdynamic workload characterization. Our work differs fromthis by characterizing workload at run-time using machineintelligence, which reduces hardware usage and increasesenergy efficiency. Table I summarizes the related works andhighlights the contribution of this paper.

III. MOTIVATIONAL EXAMPLES

Observation 1: CPU utilization and PHY processing latencyare not accurate in representing the true server workload. Todemonstrate this, we conducted an experiment where powerconsumption of the CPU cores are recorded while executingthe PHY subframes for a given workload at every 1 minuteinterval. The CPU utilization and the processing latency arealso recorded at the same intervals. At the end of everyhour, we performed cross correlation between the powerconsumption and CPU utilization/processing latency. TableII reports the statistics over a 24-hour period on a typicalwork day. Correlation coefficients are computed as follows.

TABLE II: Correlation between workload and power.

WorkloadCorrelations with power consumptionMinimum Maximum Average

CPU utilization ( [13]) 0.49 0.98 0.76PHY latency ( [11], [25]) 0.23 0.93 0.81Proposed 0.95 1.0 0.98

TABLE III: Analysis results using Bhaumik et al. [13].

TimeEstimated Actual Allocated Required

Latencyworkload workload resource resource

3:00 – 4:00 AM 12.5 8 80 48 8ms6:00 – 7:00 AM 12.5 25 80 150 12ms9:00 – 10:00 AM 33.3 33.3 200 200 10ms

Let x, y ∈ R1×n be the two measures with n observations.The correlation coefficient is

r =

∑ni=1 (xi − x)(yi − y)√∑n

i=1 (xi − x)2√∑n

i=1 (yi − y)2(1)

where x = 1n

∑ni=1 xi and analogously for y. Clearly, closer

the coefficient to 1, higher is the correlation between x and y.As can be seen from the table, the correlation coefficient

varies widely across the day when CPU utilization or PHYlatency are used as workload measures. The average correla-tion coefficients are also low. This is because CPU utilizationdoes not capture all hardware details, specific to a workload.For example, in some cases certain background jobs causesthe CPU utilization to be higher resulting in workload overestimation. Our proposed workload measure incorporates alldetails with a high average correlation of 0.98.Observation 2: Inaccurate workload measures can impactquality-of-service. Table III demonstrates this using [13]. Theworkload values are reported in arbitrary units. Resource countis measured in the number of CPU cores deployed for process-ing. Additionally, three time slots are shown for demonstration.Between 3:00 – 4:00 AM, the estimated workload is an overestimation of the actual workload. As a result more resourcesare allocated than needed (80 as opposed to required 48). Thelatency obtained is therefore lower than the required 10ms(refer to Section IV). On the other hand, between 6:00 – 7:00AM, the estimated workload is an underestimate, leading todeployment of fewer resources than needed (80 as opposedto 150). In this case the latency can go up to 12ms. Thisimpacts data rate offered/guaranteed to the user. Anotherimplication is that occasionally subframes can get droppeddue to resource overload. Finally, between 9:00 – 10 AM, theestimated workload is equal to the actual workload leading toallocation of required resources and achieving 10ms latency.

IV. C-RAN OVERVIEW AND PHY MAPPING CHALLENGES

In a general C-RAN implementation, the RRHs are dis-tributed across multiple sites, while the BBUs are clusteredtogether and processed centrally on a cloud server. Figure 1shows an overview of the implementation. Baseband sig-nal processing tasks in the cloud belong to three protocol


CONFIDENTIAL

Base Station Pool Real-time Centralized Processing in the CloudIaaS for telecomm operatorsReduced base station sitesLower power consumption

Lower CAPEX / OPEX

High bandwidth optical transport networkAdaptable to dynamic network load

Distributed remote radio head (RRH)Cooperative multipoint processing

Fig. 1: Mapping PHY processing on the cloud.

CONFIDENTIAL

Frame (N-1) Frame (N) Frame (N+1) eNB

Subframe 0 Subframe 9

1 ms

UE

Fig. 2: LTE-A Frame format.

stacks [18] (1) the physical layer (or PHY), (2) the mediaaccess control layer (or MAC) and (3) the radio resourcecontrol layer (or RRC). We start with the PHY layer, whichis the most resource critical layer with two key performanceconstraints – strict latency and dynamic data rate. Our contin-uing work will address higher protocol layers. Pertaining toPHY of a cellular network, uplink refers to data flow fromuser equipment (UE) e.g., cell phones to the base stationcalled eNodeB (eNB). The downlink refers to data flow fromeNB to UE. We demonstrate our approach for the downlinkdirection, which is equally applicable to the uplink direction.We enumerate the requirements for PHY processing below.

A. Round-trip LatencyFigure 2 shows the LTE-A frame format [18]. Each frame

(10 ms) is composed of 10 subframes of 1ms each, carryinguser data. Our approach involves mapping these subframeson the cloud hardware such as the microservers. In thispaper, PHY signal processing will refer to processing a PHYsubframe. The latency specification in the LTE-A standard isas follows: in the control plane, the transition time from idleto connected should be lower than 50 ms; in the user plane,the latency from the time a packet is received at the IP layerin UE (or eNB) to the time when it is available at the IPlayer of eNB (or UE) needs to be less than 5 ms in unloadedcondition (single user with single data stream). It is expectedthat the number of users and data streams will increase theexperienced latency, with the specified maximum user planelatency as 10 ms. This is the signal processing time in uplinkand downlink directions. Our framework works unanimouslywith any specified latency constraint, making it applicable tocurrent and future telecommunication standards.

B. Time and Space-varying Data RateFigure 3 shows the typical mobile traffic experienced by an

operator. Variations are shown over a 24 hour period for office

Time (hours)0 5 10 15 20

Nor

mal

ized

Wor

kloa

d

0.2

0.4

0.6

0.8

1

office base stationresidential base station

Fig. 3: Workload variation in time and space [28].

CONFIDENTIAL

LTE-A PHY

CPU

Memory Hierarchy

GPU DSP

Operating System (Ubuntu)

PredictWorkload

Select Mapping & Frequency

OS Interface

Represent Workload

Statistics

get CPU statistics

set CPUfrequency

Frequency

CPU

Hypervisor Docker Bare OS

Fig. 4: Proposed run-time framework.

and residential base stations as indicated in the figure. Thistraffic pattern experiences change due to (1) weekly variations,such as different patterns on weekdays compared to weekends;(2) seasonal variations, such as encountered on holidays; (3)special events, such as sports events, music concerts, etc. and(4) unforeseen and sudden events, such as natural disasters.Data rate required to support this mobile traffic, also followssimilar distribution and variations.

C. PHY Mapping Related Optimization Metric

We use energy efficiency [29] as the optimization metric,which is defined as the subframe bits processed per unit ofenergy consumption (Joule) i.e.,

energy efficiency =subframe bits

energy(2)

We aim to maximize the energy efficiency, while satisfyingthe round trip latency.

V. RUN-TIME MAPPING FRAMEWORK

Figure 4 shows our proposed run-time framework. A mi-croserver system is represented using three layers of abstrac-tion. The top layer is the application layer, which consists ofapplications (in this case LTE-A PHY signal processing tasks)to be mapped on the hardware. The middle layer is the systemsoftware layer, also called the operating system layer. A typicaloperating system such as Ubuntu, coordinates an applicationexecution on the hardware. The bottom layer is the hardwarelayer, composed of one or more CPU cores, specialized hard-wares such as graphical processing unit (GPU), digital signalprocessor (DSP) and memory hierarchy. The PHY processingcan be mapped to the hardware directly (bare OS mapping)


or using cloud virtualization technologies such as hypervisorand docker. In this work we demonstrate mapping of PHYsignal processing directly on the hardware. Hypervisor- anddocker-based approaches will be addressed in future.

Our run-time framework is detailed to the right in Figure 4.At every N subframes, the following actions are performed.• Step 1: Collect workload for the last N subframes;• Step 2: Predict workload for the next N subframes;• Step 3: Determine number of CPUs and their frequencies

for next N subframes.Our framework collects performance statistics by reading CPUhardware counters [30], [31]. These statistics are combinedto represent a microserver’s workload in the last interval.Using history of prior workloads, our framework predicts theworkload for the next N subframes. The predicted workload isused to determine resource mapping and hardware frequency,considering latency and dynamic data rate requirements. Themapping is enforced using the OS running on the hardware.Our framework uses three models to allocate resources.• Workload Model: to transform hardware statistics into

workload representation;workload = f (hardware statistics) (3)

• Workload Prediction Model: to predict future workloadnext workload = g(current workload, history) (4)

• Workload Latency Model: to transform workload to pro-cessing latency.

latency = h(workload, n, f ) (5)

where n is the number of cores and f is the frequencyof the cores.

To elaborate our run-time framework and demonstrate theuse of these three models, Figure 5 presents the detailedarchitecture of our approach. Representative workloads areused to generate three models. The workload model is usedto generate the current workload i.e., the workload obtainedby processing the N subframes. This workload is then fed tothe workload prediction model to predict the next workload.The predicted workload is then fed to the workload latencymodel to estimate the processing latency. The flow of modelin the architecture is shown with doted lines. The self edge onnode C indicates an optimization is performed to determinethe number of cores and frequency.

A. Workload Model

Modern high performance processors are equipped withregisters and counters that record certain hardware events [30],[31] such as the number of hardware clock cycles spentto execute an application (cpucycles). These performancestatistics collected over a period of time describe an applica-tion’s performance on the processor. We therefore define anapplication’s workload in terms of these statistics. However,some of these statistics correlate to a greater extent with others.Our aim is to find the minimal set of statistics that completelydescribe the workload of an application. To show the correla-tion among the performance statistics, we conducted an exper-iment where 13 commonly known performance statistics are

Workload LatencyModel

Workload PredictionModel

WorkloadModel

A

PMU

CPU CPU

B C

ApplySettings

RepresentativeWorkload

LiveWorkload

CurrentWorkload

PredictedWorkload

Hardware

OS

ResourceAllocation

Fig. 5: Detailed architecture of the proposed approach.

1

0.97

0.89

0.88

0.9

0.99

1

0.89

1

0.98

0.92

0.82

0.85

0.97

1

0.89

0.89

0.96

0.99

0.98

0.89

0.99

0.99

0.87

0.92

0.87

0.89

0.89

1

0.81

0.84

0.87

0.9

1

0.9

0.89

0.76

0.79

0.81

0.88

0.89

0.81

1

0.86

0.89

0.88

0.8

0.88

0.86

0.86

0.83

0.87

0.9

0.96

0.84

0.86

1

0.94

0.92

0.83

0.93

0.95

0.79

0.96

0.85

0.99

0.99

0.87

0.89

0.94

1

0.99

0.87

0.99

0.99

0.9

0.88

0.87

1

0.98

0.9

0.88

0.92

0.99

1

0.89

1

0.99

0.91

0.85

0.86

0.89

0.89

1

0.8

0.83

0.87

0.89

1

0.89

0.89

0.76

0.79

0.8

1

0.99

0.9

0.88

0.93

0.99

1

0.89

1

0.99

0.9

0.86

0.86

0.98

0.99

0.89

0.86

0.95

0.99

0.99

0.89

0.99

1

0.86

0.9

0.85

0.92

0.87

0.76

0.86

0.79

0.9

0.91

0.76

0.9

0.86

1

0.69

0.81

0.82

0.92

0.79

0.83

0.96

0.88

0.85

0.79

0.86

0.9

0.69

1

0.82

0.85

0.87

0.81

0.87

0.85

0.87

0.86

0.8

0.86

0.85

0.81

0.82

1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

branchinst

branchmiss

buscycles

cachemiss

cacheref

cpucycles

hwinst

hwrefcycles

l1daccess

l1dmiss

l1iaccess

l1imiss

llmiss

branchinst

branchmiss

buscycles

cachemiss

cacheref

cpucycles

hwinst

hwrefcycles

l1daccess

l1dmiss

l1iaccess

l1imiss

llmiss

Fig. 6: Correlation between 13 CPU statistics.

collected from the hardware while processing N = 100 phys-ical layer subframes. Figure 6 reports the cross-correlation ofthese 13 statistics as a matrix. A correlation value of 1 (or closeto 1) implies high correlation between the pair of performancestatistics. This can be seen as example for branchinst,which is strongly correlated to cpucycles, hwinst andl1daccess. Similarly, branchmiss is strongly correlatedto cpucycles,l1daccess and l1dmiss. To identifythe minimal subset of these statistics in order to representworkload, we use principal component analysis (PCA) [19].PCA is a statistical process of transforming a set of correlatedvariables into a set of uncorrelated orthogonal variables, calledprincipal components. An application’s workload can then berepresented in terms of these principal components.

Let Xi , i ∈ 1, · · · ,N , be N observations, each recordedafter decoding a PHY subframe. An observation Xi consistsof M CPU statistics {x1

i , x2i , · · · , x

Mi }. Thus, CPU statistics are

represented as the vector X ∈ RN ×M . PCA of X is given by

P = X × R (6)

where P ∈ RN ×L are the principal components for the N


Variances

02

46

810

12

1 2 3 4 5 6 7 8 9 10

Principle Components

Fig. 7: Variance of the principal components.

prei1

prei2

prein

postj

ri1,j

ri2,j

rin,j

(a) (b)

f

ba c

e

d4 11 3

32

5 5

Fig. 8: Feedforward and recurrent spiking neural networks.

observations and R ∈ RM ×L is the rotation matrix withL principal components. Once the principal components areidentified, we plot the variance of the components. This isshown in Figure 7. Variance of the principal componentssignify the importance of the component. As seen from thisfigure, variance of all but the first components are insignificantand thus can be omitted, without loss in accuracy. So, L = 1in our experimental setup, implying that the workload due toPHY subframe processing consists of a single value, which isan unique combination of the CPU statistics.Model generation using supervised learning: We use su-pervised learning technique to generate the workload model.In particular, we use offline PHY subframe data, syntheticallygenerated by varying RAN parameters (refer to Section VIfor an overview of the RAN parameters). These offline dataare used to determine the rotation matrix R ∈ RM ×1. At run-time, when real subframes are processed, the rotation matrixis applied to generate the workload.

B. Introduction to Neural Networks

Figure 8 shows two example neural networks. From com-puting perspective, an output neuron computes the weightedsum of responses from its input neurons.

y j =∑i

wi · xi (7)

The type of response defines the class of neural network,i.e, spiking neural networks communicate using short pulses(spikes), whereas classical neural networks communicate usingcontinuous (analog) values. Topologically, neural network canbe classified into feedforward architecture, where informa-tion flow is uni-directional and recurrent architecture, whereinformation can flow forward and backward, with potential

3

Reservoir

Input

PredictionClassificationReadout

Random

Random Echo State Property

Trained

Fig. 9: A typical echo state network.

existence of cyclic information flow. Of the different learningalgorithms developed for neural networks, we focus on su-pervised learning, which is the prevalent approach for imageprocessing and pattern recognition problems. In supervisedlearning, a neural network is trained with representative datato distinguish images and recognize patterns, where the train-ing process involves adjusting the synaptic weights betweenneurons. Once trained, neural networks are used to performthe desired task with new inputs to the network.

C. Workload Prediction Model

In Section V-A we have shown how PCA can be used torepresent a single-valued workload due to processing LTE-APHY subframes. The next model is the workload predictionmodel, which is used to predict the workload for the nextset of N subframes (refer to Figure 5). This will enable usto implement a predictive strategy for resource allocation.Single-value workloads collected at every instance of timeform time-series data. Several approaches have been proposedfor online prediction of time-series data [32]. Fundamentally,these approaches store a certain amount of historical data topredict future workloads. The effectiveness being dependenton the amount of history to be stored. In this context it isto be noted that storing the history has a significant impacton the system cost. We seek a time series predictor whichhas the least overhead in terms of storing the past (history)workloads, while still being able to perform workload predic-tion with reasonable accuracy. We use reservoir computing,an important class of neural networks providing a dynamicmodel for processing time series data. In reservoir computingapproach, an input signal in injected into a network of artificialneurons, which transforms the input and maps it into ahigher dimensional space facilitating prediction (forecasting)or classification using a simple adjustable memory-less readoutunit. In this work we focus on a class of reservoir computing,called the Echo State Network [33], which is essentially anon-trainable recurrent neural network reservoir followed bya trainable linear readout as shown in Figure 9. Inputs tothe echo state network are the time series data i.e., statisticsrecorded from the hardware and converted to workloads. Ourchoice of Echo State Network is guided by the fact that echostate network can implement long memory without requiringto store all the previous states [33]. This is efficient interms of resource usage (compute and storage) and energy


1 1 1

Wi

F

V

L

Ns

Fig. 10: Neural network for workload prediction.

consumption. Let wi be the workload at ith instance of time.The predicted model at (i + 1)th instance is given by

ˆwi+1 = g(wi, wi−1, wi−2, · · · , wi−n+1) (8)

where n is the number of past workloads needed for efficientlypredicting the workload. We use supervised training withoffline data to build the model. The principle involves trainingthe readout unit of the echo state network. Essentially, n pastworkloads are fed to the reservoir and the predicted output wi

is generated from the readout. Next, the readout weights areadjusted to reduce the error difference ∆w, given by

∆w = |wi − wi | (9)

D. Workload Latency ModelThe last model in our framework is the workload latency

model. This model depends on the following• the predicted workload wi ;• the frequency of the processing core F;• the voltage of the core V ; and• the number of subframes mapped per processing core Ns

This model can be represented asL = h(wi, F, V , Ns ) (10)

We use neural networks to model the processing latency asshown in Figure 10. This choice is guided by the fact thatmultiple hidden layers in a neural network allow hierarchicalcombination of features, implementing a nonlinear represen-tation of the latency. This is contrary to techniques such assupport vector machines, which are essentially a single layerneural networks, without hierarchical feature combinations.Additionally, neural networks can be efficiently implementedon hardware and software. The 4 parameters – predictedworkload, frequency, voltage and subframes per core formthe input neurons of the structure. The neural network hasone output neuron, which represents the latency. There arethree hidden layers in the neural network structure (Figure 10),classifying this as deep neural network. We apply supervisedlearning to train the neural network with offline data as shownin Figure 5. Finally, in line with previous research on neuralnetwork model selection, we have used sequential backwardselection [34], a greedy algorithm to select the features of thedeep neural network including the number of hidden layersand neurons per hidden layers. This is a search method thatstarts with all the features and removes a single feature at eachstep with a view to improving the cost function.

E. Optimizing Resource Selection

In Sections V-A–V-D, we formulated the three models,which form the basis for resource selection. To define theoptimization problem that we aim to solve in this work, thefollowing variables are introduced. Let xi, j be the binaryvariable representing the mapping for the ith subframe on j th

core, where i ∈ {0,1, · · · ,N − 1}, where N is the number ofsubframes to be mapped and j ∈ {0,1, · · · ,C − 1}, where C isthe number of cores in the system i.e.,

xi, j =

1 if i th subframe is mapped to j th core

0 otherwise(11)

Let (Vk ,Fk ) for k ∈ 0,1, · · · Nf − 1 be the Nf voltage-frequency pairs supported on the system. We introduce thebinary variable yi,k to denote the voltage-frequency assign-ment of the subframe i.

yi,k =

1 if i th subframe is assigned to k th voltage-frequency pair

0 otherwise(12)

The following constraints and equations are in place.• A subframe is mapped to one core only∑

j

xi, j = 1 ∀i (13)

• A subframe is assigned to one voltage-frequency pair∑k

yi,k = 1 ∀i (14)

• The number of subframes mapped to core j is

Njs =∑i

xi, j (15)

• The latency for processing ith subframe is

Li =∑j

∑k

xi, j · y j,k · h(wi, Fk , Vk , Njs ) (16)

• The latency for each subframe should be less than thelatency specified in the LTE-A specification, i.e.,

Li ≤ 10ms ∀i (17)

Next we formalize the optimization objective of Equation 2 interms of the variables xi, j and yi,k . For this, we use the energymodel of [35], which gives the energy values corresponding toa workload, voltage and frequency of operation. Let this modelbe denoted by M . The energy consumption of subframe i is

Ei =∑j

∑k

xi, j · y j,k · M (wi, Fk , Vk ) (18)

The performance (energy efficiency) is given by

Performance = P =∑i

Bi

Ei(19)

where Bi is the number of bits needed to encode subframe i.The optimization problem is therefore

Maximize P (20)

Constraints: Equations 13, 14 & 17

The optimization problem Equation 19 is non-linear(quadratic). We use hill climbing technique [36] to solve theoptimization problem. This is a class of iterative techniquesused to find the local maximum of an optimization objective. It


CONFIDENTIAL

Quad core CPUL1: 256KBL2: 1MBL3: 6MBRAM: 8GB

Fig. 11: Setup showing the downlink simulator [18].

starts with an arbitrary solution to the problem, then attemptsto find a better solution by incrementally changing a singlevariable at a time per iteration. If the change produces abetter solution, the change is made permanent and the wholeprocess is repeated. The process terminates when no furtherimprovements can be found.

Solution to the optimization problem in Equation 20 are thevariables xi, j and yi,k , which are used to determine mappingand the voltage-frequency assignments. The total number ofcores utilized in the mapping is determined as follows. Thenumber of subframes mapped to core j is given by

N j =∑i

xi, j (21)

Finally, the number of cores in the mapping are the coreswhere one or more subframes are mapped calculated as

Nc =∑j

min(1, N j ) (22)

VI. RESULTS

Figure 11 shows our experimental setup. The PHY Simula-tor of the OpenAirInterface [22] is used as the application foruplink and downlink directions. Figure 11 shows dataflow inthe downlink direction only. Following RAN parameters areused for workload characterization and latency modeling.• PRB: In Section IV, we show that a 10ms LTE-A frame

is composed of ten 1ms subframes, where subframes arethe smallest abstraction for our resource allocation strat-egy. Each 1ms subframe consists of two 0.5ms slots toaccommodate user data represented as physical resourceblocks (PRBs). Typically, 6 to 110 PRBs are supportedper 0.5ms slot depending on the bandwidth allocationand resource availability. Changing the PRB, changesthe processing demand of the subframe and therefore,we change this parameter to generate offline syntheticdata for generating our model. In particular, three valuesare used for the PRB: 25 (low bandwidth), 50 (moderatebandwidth) and 100 (high bandwidth).

• MCS Index: The modulation and coding scheme (MCS)index value summarizes the modulation scheme andcoding rate used for the PRBs in a subframe. As specifiedin the LTE-A standard [18], 32 values are supportedfor the MCS index covering 3 modulations schemes –QPSK, 16QAM and 64QAM. The code rate can bedetermined by mapping the MCS index to the transport

0 5 10 15 20

050

100

150

200

250

Time (h)

Num

ber o

f cores

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

ProposedBhaumik et al

Fig. 12: CPU cores needed to sustain variable data rate.

block size as specified in the standard. A simple exampleis provided below to determine the throughput and coderate for a given MCS index. Let MCS index = 20.According to Table 7.1.7.1-1 of LTE-A standard [18],the corresponding transport block size (TBS) index is 18.Furthermore, we assume that number of PRB = 10. Fromtable 7.1.7.2-1 of the LTE-A standard, a PBS index of18 and PRB = 10 corresponds to PBS size of 4008 bits.Assuming no MIMO is used, the throughput and coderate are given by

throughput = TBS × 1000 = 4.008Mbps (23)

code rate = (TBS + CRC)/(RE x MO)

where

TBS = 4008CRC = 24RE = PRB × sub-carriers × OFDM symbols ×

slots per subframe × efficiency= 10 × 12 × 7 × 2 × 0.9 = 1512

MO = Modulation order from Table 7.1.7.1-1 = 6

The code rate is therefore

Code Rate =4008 + 241512 × 6

= 0.44 (24)

It is to be noted that in deriving the code rate, we haveassumed standard channel parameters of 24 bits CRC, 12sub-carriers and 7 OFDM symbols per sub-carrier. Fromthe calculation, it can be inferred that higher the MCSindex, higher is its spectral efficiency, i.e., the code rate.Thus, varying the MCS index also changes the processingdemand for a subframe and is therefore used to generateoffline data for training our models.

The general purpose hardware is a microserver with quad-core Intel Haswell processors [23] (configuration in Figure 11).Hardware performance statistics is probed using PAPI [37]and energy consumption is estimated using likwid [38]. Allthe models are implemented in R [39].

A. Resource Adaptation to Dynamic Data Rate

Figure 12 reports the resource (CPU cores) adaptationusing the proposed approach in response to dynamic datarate (Figure 3). Also reported in the figure is the resourceadaptation using the existing technique of Bhaumik et al. [13].It is to be noted that the resource requirement using the


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

MCS Index

Energy

Effi

cien

cy (M

bits

/J)

020

4060

80100

Chen et al. Bhaumik et al. Proposed

Fig. 13: Energy efficiency at peak traffic condition.

existing technique of Chen et al. [12] is similar to [13] and istherefore not included in the figure. As discussed in Section II,the existing techniques such as [13] is conservative in naturefor most of the duration in the day. This means, resources arereserved beforehand, based on the mobile traffic observed inthe past hours. The adaptation is independent of the workloadhistory over daily, weekly and seasonal variations. This isalso confirmed from the figure which shows stable resourceallocation for [13] for most part of the day. Contrary to this,resource allocation using our technique varies throughout theday in response to time varying data rate and history ofworkload. At times of low workload (0:00 – 5:00 AM), theproposed approach reduces resource requirements by 12 – 34CPUs, an average 36% reduction. At other times where thedata rate demand is high (such as at 8:00 AM), the proposedtechnique needs more CPUs than [13]. However, workloadpredicted using [13] is not accurate (see Section III for theaccuracy comparison). Therefore, the data rate achieved usingthis existing technique is occasionally lower than the requireddata rate, implying variable quality-of-service provided tousers or in worst case dropping of users from service [17]. Ourapproach results in better resource utilization, guaranteeing acertain quality-of-service to users.

B. Energy Efficiency at Peak Mobile Traffic

Figure 13 reports the energy efficiency obtained using theproposed approach compared to two existing techniques –Chen et al. [12] and Bhaumik et al. [13]. Energy efficiencies(measured as Mbits/J) at peak traffic condition are reportedfor different MCS indexes with PRB = 100. Results for otherPRBs are similar. As can be clearly seen from the figure,energy efficiencies are comparable for all three approaches forlower MCS indexes (MCS index < 5). However, the energyefficiency of our approach becomes significantly higher thanexisting techniques for higher MCS indexes (MCS index > 5).A second trend from this figure is that improvements usingour approach over existing approaches are higher for higherMCS indexes. This is evident from the energy efficiency of ourapproach at MCS index say, 26 which is 48% higher than theenergy efficiency of [13] at the same MCS index. As describedin Section VI, high MCS index imply high spectral efficiency,our approach therefore provides higher benefits for higherspectral efficiency regions, which are attributed to regions ofdense users. This is an important consideration for C-RAN,

Fig. 14: Accuracy of PCA-based workload modeling.

which yields most benefit (operating cost wise) when deployedto dense populated zones.

Overall, the improvements using our approach can be at-tributed to better workload prediction and deep learning basedrun-time framework. To summarize, the energy efficiencyusing the proposed approach is higher than Chen et al. [12]by 1.4 Mbps/W – 26 Mbps/W, average 28% (maximum 48%)improvement. Compared to Bhaumik et al.. [13], the energyefficiency of the proposed approach is higher by 0.5 Mbps/W –19.1 Mbps/W, an average 16% (maximum 43%) improvement.

C. Accuracy of the Workload Prediction Model

To compare the accuracy of the proposed PCA-based ap-proach to represent hardware workload due to PHY subframeprocessing, we conducted an experiment where performancestatistics collected from the hardware are used individuallyto correlate with the PHY subframe processing latency. Thisis shown in Figure 14. For each performance statistic, thereare two bars, corresponding to minimum and maximum cor-relation values at different hardware frequencies. The averagecorrelation for each performance statistic is also shown in thefigure as a line plot. Finally, the figure reports correlation usingthe proposed PCA-based approach (last two bars in the figure).As seen from the figure, the PCA-based approach demonstrates99.88% correlation with PHY subframe processing latency.This is similar to using some other performance statistics, suchas the cpucycles. However, advantage of the PCA-basedapproach is that, by solely utilizing the rotation vector andsingle-valued workloads, all statistics can be re-created, whichcan be exploited to further optimize the mapping. Additionally,the space required to store all performance statistics for Nobservations is (N ·M · B), where B is the number of bits usedto store each statistic. For the PCA-based approach, this is(N + M) · B, where the first term in parenthesis is the spacerequired to store N single-valued workloads and the secondterm is the space required to store the rotation vector withone principal component. Clearly, the storage required for theproposed PCA-based approach is lower than that required forstoring all statistics naively.

Figure 15 reports the accuracy of workload prediction usingour proposed echo state network-based predictor. Results are


1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

History (mins)

(a) History = 2 mins.

1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

1.2

History (mins)

(b) History = 4 mins.

Fig. 15: Workload prediction using neural network with historyof past workloads. Shaded area represents prediction error.

reported for two, three and four minutes of workload history.In each of this figure, the actual workload is shown as thesolid line. In Figure 15a, the first two minutes of the plotshow the workload used to train the predictor. This is followedby a shaded region, which represents uncertainty in workloadprediction around the actual workload of the next few minutes.The uncertainty region is drawn by generating the predictedoutput from the trained echo state network for 1 millioniterations. The prediction accuracy is inversely proportionalto width of the uncertainty region. It can be also seen fromthe figure that, there are actually two shaded regions aroundthe actual workload. The light shaded regions represents theresults obtained when the echo state network is trained fora limited iterations (i.e., low effort training), while the darkshaded region is obtained by training the echo state networkfor extended iterations, allowing the network to convergebefore using it. This results prove that higher is the trainingeffort of the echo state network based predictor, better is theaccuracy of prediction. Results from other workload historycan be similarly inferred.

Figure 15b shows the prediction uncertainty when using4 minutes of workload history to train the predictor. Theaccuracy of prediction improves significantly as compared tousing 2 and 3 minutes of workload history, as seen from thereduced width of the uncertainty region. This result implythat selection of the amount of workload history is criticalto improving the accuracy of the predictor.

To further analyze this, Figure 16 shows workload predic-tion errors for a reference workload by varying the amountof history used to predict the future workloads. We used k-fold cross validation [40], splitting the collected data intotraining and test sets. The echo state network is trained usingdifferent training sets. Variations are indicated using error barsfor ten runs. As seen from this figure, with less amount ofhistory (prior workloads), the prediction error is high. This isbecause with less data to train, the model becomes biased tothe limited training set and produces high error pn the unseenworkload. This phenomena in statistics is called model underfitting. On the other hand, using too much data to train themodel also causes biasedness in the model, preventing it to begeneralized to unseen workloads. This phenomena is calledmodel over fitting [41]. In this example, a minimum error is

Fig. 16: Workload prediction accuracy.

Fig. 17: Modeling accuracy of PHY signal processing latency.

observed with 35 minutes of workload history. This error isthe accuracy after the echo state model converges as discussedbefore. It is to be noted that the minimum workload point(35 minutes in the example above) depends on the workloadused to train the echo state network. We performed a thoroughdesign space exploration to identify the minimum error pointfor any given workload (Figure 3). Additionally, our run-timeframework periodically evaluates the predictor to incorporatechanges in workload variation in order to generate accurateworkload prediction, leading to efficient resource allocation.

D. Accuracy of the Workload Latency Model

In this section we evaluate performance of the proposeddeep learning-based modeling of PHY latency for a singlesubframe. Figure 17 shows the actual and the predicted la-tency for three modulation schemes – QPSK, 16QAM and64QAM. Results are shown for all frequencies supportedon the hardware. Also reported in each figure is the rootmean square of the modeling error. There are few trendsto follow from this figure. First, the latency (both predictedand actual) for 64QAM is higher than 16QAM, which ishigher than QPSK. This is due to more processing requiredfor 64QAM than 16QAM, which requires more processingthan QPSK. Second, the latency reduces with an increase inCPU frequency for all three schemes. This is because, witheach subframe represented as a fixed number of CPU cycles,the time period for each cycle decreases with an increase infrequency, reducing the overall execution time. Finally, themodeling error is less than 4% (3.86%) for all three schemes,


1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.8 3.0 3.2 3.4 3.41

CPU Frequency (GHz)

Err

or (%

)

2.86

3.86

4.86

5.86

6.86

Fig. 18: Accuracy of latency modeling.

justifying the accuracy of the proposed deep learning-basedworkload predictor. To clarify this further, Figure 18 plots theaverage modeling error of the proposed approach for differentCPU frequencies, averaged over all RAN parameters. The plotis centered around the average error of 3.86%. As can beseen, the modeling error is high (≈ 7%) for the turbo modefrequency of 3.401 GHz. This is because, this operation modeis not used in generating the latency model. However, themodel is still reasonably accurate at this frequency. This resultimplies that the workload latency model can be transferable toother hardware, and the prediction results will still be within anacceptable accuracy bound (≤ 10%). As an example, the samemodel can be used if the microserver hardware is altered toIvy Bridge or Broadwell processor type instead of Hashwell. Asecond conclusion can be inferred. A cloud system designercan use this model to estimate the PHY processing latencyfor a choice of core before actual deployment. This will guidesuch decisions as upgrading/downgrading cloud hardware orselecting a particular backup server in case of server downtimedue to scheduled maintenance or server faults.

VII. CONCLUSION

We proposed an adaptive run-time framework to map PHYsignal processing tasks on a general purpose hardware. Byusing machine learning we have demonstrated that the frame-work can adapt to dynamic data rate and still meet the strictlatency specification. Additionally, by using neural networksto accurately predict workload, the resource consumption canbe reduced by up to 36%, increasing the average energyefficiency by 28% (maximum 48%). Our continuing work willaddress cloud virtualization and mapping of higher layer signalprocessing tasks on microservers.

ACKNOWLEDGMENT

This work is supported by EU-H2020 grant NeuRAM3Cube (NEUral computing aRchitectures in Advanced Mono-lithic 3D-VLSI nano-technologies).

REFERENCES

[1] X. Zhang, Z. Yi, Z. Yan, G. Min, W. Wang, A. Elmokashfi, S. Maharjan,and Y. Zhang, “Social computing for mobile big data,” Computer,vol. 49, no. 9, pp. 86–90, 2016.

[2] P. Rost, A. Banchs, I. Berberana, M. Breitbach, M. Doll, H. Droste,C. Mannweiler, M. A. Puente, K. Samdanis, and B. Sayadi, “Mobilenetwork architecture evolution toward 5G,” IEEE Communications Mag-azine, vol. 54, no. 5, pp. 84–91, 2016.

[3] A. Checko, H. L. Christiansen, Y. Yan, L. Scolari, G. Kardaras, M. S.Berger, and L. Dittmann, “Cloud RAN for mobile networks – A tech-nology overview,” IEEE Communications Surveys & Tutorials, vol. 17,no. 1, pp. 405–426, 2015.

[4] M. Peng, Y. Sun, X. Li, Z. Mao, and C. Wang, “Recent advances incloud radio access networks: System architectures, key techniques, andopen issues,” IEEE Communications Surveys & Tutorials, vol. 18, no. 3,pp. 2282–2308, 2016.

[5] M. Palkovic, P. Raghavan, M. Li, A. Dejonghe, L. Van der Perre,and F. Catthoor, “Future software-defined radio platforms and mappingflows,” IEEE Signal Processing Magazine, vol. 27, no. 2, pp. 22–33,2010.

[6] T. Kaitz and G. Guri, “CPU-MPU partitioning for C-RAN applications,”in Conference on Communications and Networking in China (CHINA-COM). IEEE, 2012, pp. 767–771.

[7] “Suggestions on potential solutions to C-RAN,” NGMN Alliance WhitePaper, 2013.

[8] Z. Cai and D. Liu, “Baseband design for 5G UDN base stations: Methodsand implementation,” China Communications, vol. 14, no. 5, pp. 59–77,2017.

[9] B. Habib and B. Baz, “Digital architecture of 8× 8 MIMO hardwarechannel simulator for time-varying heterogeneous systems with lte-a,802.11 ac and vlc signals,” in International Conference on Advances inComputational Tools for Engineering Applications (ACTEA). IEEE,2016, pp. 195–200.

[10] N. Kai, S. Jianxing, C. Kuilin, and K. K. Chai, “TD-LTE eNodeBprototype using general purpose processor,” in Conference on Commu-nications and Networking in China (CHINACOM). IEEE, 2012, pp.822–827.

[11] L. Guangjie, Z. Senjie, Y. Xuebin, L. Fanglan, N. Tin-fook, S. Zhang,and K. Chen, “Architecture of GPP based, scalable, large-scale C-RANBBU pool,” in IEEE Globecom Workshops, 2012, pp. 267–272.

[12] Z. Chen and J. Wu, “LTE physical layer implementation based on GPPmulti-core parallel processing and USRP platform,” in Conference onCommunications and Networking in China (CHINACOM). IEEE, 2014,pp. 197–201.

[13] S. Bhaumik, S. P. Chandrabose, M. K. Jataprolu, G. Kumar, A. Muralid-har, P. Polakos, V. Srinivasan, and T. Woo, “Cloudiq: a framework forprocessing base stations in a data center,” in International Conferenceon Mobile Computing and Networking (MobiCom). ACM, 2012, pp.125–136.

[14] B. Guan, X. Huang, G. Wu, C. Chan, M. Udayan, and C. Neelam, “Apooling prototype for the LTE MAC layer based on a GPP platform,”in IEEE Global Communications Conference (GLOBECOM). IEEE,2015, pp. 1–7.

[15] D. Maidment and N. Werdmuller, “The Route to 5G: the key featuresof next generation cellular communications and the key technologycomponents required,” ARM White paper, 2016.

[16] D. Zeng, J. Zhang, S. Guo, L. Gu, and K. Wang, “Take renewableenergy into cran toward green wireless access networks,” IEEE Network,vol. 31, no. 4, pp. 62–68, 2017.

[17] U. K. Shukla et al., “Methods and apparatus for reducing call drop rate,”Feb. 2 2016, US Patent 9,253,664.

[18] A. Ghosh, R. Ratasuk, B. Mondal, N. Mangalvedhe, and T. Thomas,“LTE-advanced: next-generation wireless broadband technology,” IEEEWireless Communications, vol. 17, no. 3, pp. 10–22, 2010.

[19] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,”Chemometrics and Intelligent Laboratory Systems, vol. 2, no. 1-3, pp.37–52, 1987.

[20] H. Jaeger, “Echo state network,” Scholarpedia, vol. 2, no. 9, p. 2330,2007.

[21] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,no. 7553, pp. 436–444, 2015.

[22] N. Nikaein, M. K. Marina, S. Manickam, A. Dawson, R. Knopp, andC. Bonnet, “OpenAirInterface: A flexible platform for 5G research,”ACM SIGCOMM Computer Communication Review, vol. 44, no. 5, pp.33–38, 2014.

[23] P. Hammarlund, R. Kumar, R. B. Osborne, R. Rajwar, R. Singhal,R. D’Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan et al.,“Haswell: The fourth-generation intel core processor,” IEEE Micro,no. 2, pp. 6–20, 2014.

[24] D. Pompili, A. Hajisami, and T. X. Tran, “Elastic resource utilizationframework for high capacity and energy efficiency in cloud ran,” IEEECommunications Magazine, vol. 54, no. 1, pp. 26–32, 2016.

[25] Z. Xu, Y. Wang, J. Tang, J. Wang, and M. C. Gursoy, “A deep reinforce-ment learning based framework for power-efficient resource allocation


in cloud rans,” in International Conference on Communications (ICC).IEEE, 2017, pp. 1–6.

[26] N. Yu, Z. Song, H. Du, H. Huang, and X. Jia, “Dynamic resourceprovisioning for energy efficient cloud radio access networks,” IEEETransactions on Cloud Computing, 2017.

[27] W. N. S. F. W. Ariffin, X. Zhang, and M. R. Nakhai, “Sparse beam-forming for real-time resource management and energy trading in greenc-ran,” IEEE Transactions on Smart Grid, vol. 8, no. 4, pp. 2022–2031,2017.

[28] D. Naboulsi, “Analysis and exploitation of mobile traffic datasets,” Ph.D.dissertation, INSA Lyon, 2015.

[29] S. Kamil, J. Shalf, and E. Strohmaier, “Power efficiency in highperformance computing,” in IEEE International Symposium on Paralleland Distributed Processing (IPDPS), 2008, pp. 1–8.

[30] X. Zhang, S. Dwarkadas, G. Folkmanis, and K. Shen, “Processorhardware counter statistics as a first-class system resource.” in HotOS,2007.

[31] B. Sprunt, “The basics of performance-monitoring hardware,” IEEEMicro, no. 4, pp. 64–71, 2002.

[32] C. Richard, J. C. M. Bermudez, and P. Honeine, “Online prediction oftime series data with kernels,” IEEE Transactions on Signal Processing,vol. 57, no. 3, pp. 1058–1067, 2009.

[33] W. Maass, T. Natschläger, and H. Markram, “Real-time computingwithout stable states: A new framework for neural computation basedon perturbations,” Neural Computation, vol. 14, no. 11, pp. 2531–2560,2002.

[34] J. Kittler, “Feature selection and extraction,” Handbook of PatternRecognition and Image Processing, pp. 59–83, 1986.

[35] M. J. Walker, S. Diestelhorst, A. Hansson, A. K. Das, S. Yang,B. M. Al-Hashimi, and G. V. Merrett, “Accurate and stable run-timepower modeling for mobile and embedded cpus,” IEEE Transactionson Computer-Aided Design of Integrated Circuits and Systems, vol. 36,no. 1, pp. 106–119, 2017.

[36] S. M. Goldfeld, R. E. Quandt, and H. F. Trotter, “Maximization byquadratic hill-climbing,” Econometrica: Journal of the EconometricSociety, pp. 541–551, 1966.

[37] P. J. Mucci, S. Browne, C. Deane, and G. Ho, “PAPI: A portableinterface to hardware performance counters,” in Department of DefenseHPCMP Users Group Conference, 1999, pp. 7–10.

[38] J. Treibig, G. Hager, and G. Wellein, “Likwid: A lightweightperformance-oriented tool suite for x86 multicore environments,” inInternational Conference on Parallel Processing Workshops (ICPPW),2010, pp. 207–216.

[39] R. Ihaka and R. Gentleman, “R: A language for data analysis andgraphics,” Journal of Computational and Graphical Statistics, vol. 5,no. 3, pp. 299–314, 1996.

[40] J. D. Rodriguez, A. Perez, and J. A. Lozano, “Sensitivity analysis of k-fold cross validation in prediction error estimation,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 32, no. 3, pp. 569–575, 2010.

[41] G. C. Cawley and N. L. Talbot, “On over-fitting in model selectionand subsequent selection bias in performance evaluation,” Journal ofMachine Learning Research, vol. 11, no. Jul, pp. 2079–2107, 2010.

Anup Das is an Assistant Professor at Drexel Uni-versity. He received a Ph.D. in Embedded Systemsfrom National University of Singapore in 2014.Prior to his Ph.D., he was a research engineer formore than 7 years at ST Microelectronics (India andGrenoble) and LSI Corporation (India). Followinghis Ph.D., he was a post-doctoral fellow at theUniversity of Southampton from 2014 to 2015 and aresearcher at IMEC from 2015 to 2017. His researchfocuses on neuromorphic computing, from algorithmdevelopment to architectural exploration. His other

research interests include System-level design techniques for lifetime andenergy optimization, Soft-error tolerance of FPGA configuration bitstream,Synchronous data flow graph based task mapping and scheduling, Probabilisticenergy and performance optimization of multiprocessor systems, architecturaladaptations for lifetime improvement of multiprocessor, and Design-for-testability (DFT) of multi-power domain SoC.

Francky Catthoor received a Ph.D. in EE from theKatholieke Univ. Leuven, Belgium in 1987. Between1987 and 2000, he has headed several researchdomains in the area of synthesis techniques and ar-chitectural methodologies. Since 2000 he is stronglyinvolved in other activities at IMEC including deepsubmicron technology aspects, IoT and biomedicalplatforms, and smart photovoltaic modules, all atIMEC Leuven, Belgium. Currently he is an IMECfellow. He is also part-time full professor at the EEdepartment of the KULeuven.

He has been associate editor for several IEEE and ACM journals. He waselected IEEE fellow in 2005.

André Bourdoux received the M.Sc. degree inelectrical engineering (specialization in microelec-tronics) in 1982 from the Université Catholique deLouvain-la-Neuve, Belgium. He joined IMEC in1998 and is now Principal Member of the TechnicalStaff in the " Perceptive Systems for the IoT "Department of IMEC. He is a system level and signalprocessing expert for the mm-wave and sub-10GHzbaseband teams and for the mm-wave radar team.His current research interests are multi-disciplinary,spanning the areas of wireless communications and

signal processing, with a special emphasis on broadband systems and emerg-ing physical layer techniques and high resolution radars.

He is a renowned expert in OFDM, MIMO, MU-MIMO and 60GHzcommunications and high resolution radars; he holds several patents in thesefields. He has been and still is involved for IMEC in several European projects.He also represents IMEC in the standardization of wireless systems, activelycontributing to standards working groups such as IEEE 802.11 and 802.15and ETSI mWT.

Before joining IMEC, his research activities were in the field of algorithmsand RF architectures for coherent and high-resolution Doppler radar systems.

He is the author and co-author of over 150 publications in books and peerreviewed journals and conferences.

Bert Gyselinckx Biography not available.

Documents

IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND …ad3639/tgcn18.pdf · 2019-08-21 · IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, VOL. , NO., MONTH YEAR 3 CONFIDENTIAL