System-Level Power Estimation Tool for Embedded …rabie-ben-atitallah.com/paper/rapido-2014.pdf · Smail Niar LAMIH, Université de Valenciennes [email protected] Osman

System-Level Power Estimation Tool for EmbeddedProcessor based Platforms

Santhosh KumarRethinagiri

BSC-Microsoft ResearchCenter

[email protected]

Oscar PalomarBSC-Microsoft Research

[email protected]

Rabie Ben AtitallahLAMIH, Université de

Valenciennesrabie.benAtitallah@univ-

valenciennes.frSmail Niar

LAMIH, Université deValenciennes

[email protected]

Osman UnsalBSC-Microsoft Research

[email protected]

Adrian Cristal KestelmanBSC-Microsoft Research

[email protected]

ABSTRACTDue to the ever increasing constraints on power consumptionin embedded systems, this paper addresses the need for anefficient power modeling and estimation methodology basedtool at system-level. On the one hand, today’s embedded in-dustries focus more on manufacturing RISC processor-basedplatforms as they are cost and power effective. On the otherhand, modern embedded applications are becoming moreand more sophisticated and resource demanding: multime-dia (H.264 encoder and decoder), software defined radio,GPS, mobile applications, etc. The main objective of thispaper focuses on the scarcity of a fast power modeling and anaccurate power estimation tool at the system-level for com-plex embedded systems. In this paper, we propose a stan-dalone simulation tool for power estimation at system-level.As a first step, we develop the power models at the functionallevel. This is done by characterizing the power behavior ofRISC processor based platforms across a wide spectrum ofapplication benchmark to understand their power profile.Then, we propose power models to cost-effectively estimateits power at run-time of complex embedded applications.The proposed power models rely on a few parameters whichare based on functional blocks of the processor architecture.As a second step, we propose a power estimation simula-tor which is based on cycle-accurate full system simulationframework. The combination of the above two steps providesa standalone power estimation tool at the system-level.

The effectiveness of our proposed methodology is vali-dated through an ARM9, an ARM Cortex-A8 and an ARMCortex-A9 processor designed around the OMAP5912, OMAP3530 and OMAP4430 boards respectively. The efficiencyand the accuracy of our proposed tool is evaluated by usinga variety of basic programs to complex benchmarks. Esti-

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] ’14, January 22 2014, Vienna, Austria.Copyright 2014 ACM 978-1-4503-2471-7/14/01 ...$15.00.http://dx.doi.org/10.1145/2555486.2555491.

mated power values are compared to real board measure-ments for the different processor architecture based plat-forms. Our obtained power estimation results provide lessthan 3% of error for ARM940T processor, 2.9% for ARMCortex-A8 processor and 4.2% for ARM Cortex-A9 proces-sor based platforms when compared to the other state-of-the-art power estimation tools.

Categories and Subject DescriptorsI.6 [Simulation and modeling]: Model validation, analy-sis

General TermsSystem Level Power Estimation

KeywordsPower/Energy Estimation, Functional power models, RISCprocessor, Multi-core, Speedup, Accuracy

1. INTRODUCTIONDue to the increasing complexity of applications and ar-

chitectures in embedded system, designers are placed to facea very large design space solution. Exploring the designspace to reach an efficient solution becomes very difficult, es-pecially when the design must satisfy a large number of con-straints, such as performance, power and energy consump-tion. For this reasons, embedded hardware designers aredirected more and more towards parallel multi-core architec-ture based platforms as a promising solution to deal with thepotential parallelism inherent from the complex embeddedand intensive signal processing applications. As an exam-ple of commercialized platform, we quote the Texas Instru-ment OMAP4 platform embedding dual core ARM Cortex-A9 processor. Recently, the ITRS [10] and HiPEAC [14]roadmaps promote ”power defines performance” and ”poweris the wall”. In fact, power dissipation is becoming a criticalpre-design metric in complex embedded systems. Anotherimportant aspect is application execution, which drives theactivities of the underlying hardware and the manner inwhich applications use the hardware components can have asubstantial impact on the power dissipation of a system [1].

Therefore, it is becoming crucial to model power consump-tion from the perspective of the software. Facing this issue,designers should estimate the power consumption as early aspossible in the design flow to reduce the time-to-market andthe development cost. Today, system-level power estimationis considered a vital premise to cope with the critical designconstraints. However, the development of tools for powerestimation at the system-level are on the verge of extremelychallenging requirements such as an accurate power model-ing approach and a fast system-level estimation technique.

At the system-level, the power estimation process is cen-tered around two correlated aspects: the power model granu-larity and the system abstraction level. The first aspect con-cerns the granularity of the relevant activities on which thepower model relies. It covers a large spectrum that startsfrom the fine-grain level such as the logic gate switchingand stretches out to the coarse-grain level like the hard-ware component events. In general, fine-grain power esti-mation yields to a more correlated model with data andtechnological parameters, which is tedious for system-leveldesigners. Where else, coarse-grain power models dependon micro-architectural activities that cannot be determinedeasily. The second aspect involves the abstraction level onwhich the system is described. It starts from the usual Reg-ister Transfer Level (RTL) and extends up to the algorith-mic level. In general, going from low to high design levelcorresponds to more abstract description and then coarseractivity granularity. The power evaluation time increasesas we go down through the design flow and the accuracydepends on the extraction of each relevant activity and thecharacterization methodology to evaluate the related powercost. In order to have an efficient standalone power estima-tion tool, we should find a better trade-off between thesetwo aspects.

To answer the above challenges, we propose an efficienttool for power dissipation estimation of complex processor-based platforms. The idea here is to develop a power esti-mation tool at system-level, which combines functional levelpower models for hardware power modeling and a simulationtechnique for rapid prototyping and fast power estimation.The functional power models are coupled with a gem5 [2]full system simulator in order to obtain the needed activi-ties for the power models, which allows us to reach a goodbalance between accuracy and speed.

The rest of this paper is organized as follows. In Sec-tion 2, we discuss about background for this work. Sec-tion 3 exposes the proposed power estimation methodol-ogy. In Section 4, the power modeling methodology is ap-plied to 3 complex embedded processors designed aroundOMAP5912, OMAP3530 and OMAP4430 boards. To eval-uate our methodology in terms of accuracy and speed, ex-perimental results are presented in Section 5.

2. RELATED WORKSIn this section, we present approaches and tools that deal

with power estimation at different abstraction level. At thecircuit-level, the microprocessor is represented in terms oftransistors and nets which are extremely complex. It alsorequires to undergo all the steps in the design flow and thelayout, routing and parameter extraction. This is not feasi-ble since most of the time, processor manufacturers do notdisclose detailed technology information and most of theirtools are tend to be in-house. The simulation for a small set

Functional and system-level

power simulator

Cycle-Accurate (CA) and Hybrid (CA+JIT)

simulation technique

Processor ISAs Applications

Mapping

Power estimation results

STEP 2

STEP 1

SoftwareTask

Processor Architecture

Figure 1: System-level power estimation tool flow

MemoryPeripherals

I/O

Peripherals

Bus

Task 2

Data

Task and Data interface Activity counter Interface

Processor & application

mappingPower estimator Kernel

Functional Level

System-Level

ISACache and

IPC

counters

ISACache and

IPC

counters

Po

we

r m

od

els

Lib

rary

Task 3 Task 1ARM Cortex-A8 Arm Cortex-A9

ISACache

counters

ARM9

Figure 2: Standalone simulation framework

of transistors requires a large amount of time which is notpractical [7]. In an early attempt to build a low-level powerconsumption simulator, PowerMill [9] was introduced. Thistool is used for simulating current and the power charac-teristic in the circuits. It is also capable of simulating de-tailed current behavior in modern deep sub-micron CMOScircuits, including complex circuit blocks, with speed andcapacity approaching conventional gate level simulators ascited in [13].

To overcome the simulation speed drawback, almost allthe previous research was focused on the Register Trans-fer Level (RTL) for power estimation. At the RTL, thepower model is based on empirical methods that measurethe power consumption of existing implementations and pro-duces models from those measurements. Potlapally et al.[15] present a technique in which they do cycle-accuratepower macro modeling of the RTL component. They createpower macro model for each of these behaviors also known

as power modes. Their framework chooses the appropriatepower mode from the input trace in each cycle and then ap-ply power macro-modeling technique as discussed by Bogli-olo et al.[5] to get an estimate on power numbers. The tech-nique [15] is limited to the typical average power estimationscenarios and covers non-trivial scenarios as well but theestimation speed is very slow.

The concept of power estimation at the software level hasbeen introduced by Tiwari et al. [20] through the Instruc-tion Level Power Analysis (ILPA) approach. They associatea power consumption model with instructions or instructionpairs. The power consumed by a program running on theprocessor can be estimated by using an Instruction Set Sim-ulator (ISS) to extract instruction traces, and then addingup the total cost of the instructions. This approach suffersfrom the high number of experiments required to obtain themodel. In addition, it can be applicable only for processorsthat feature a simplistic architecture.

To overcome this drawback, Laurent [11] et al. proposedthe Functional Level Power Analysis (FLPA) methodologythat was successfully applied on building high-level powermodels for different hardware components (processor, mem-ory, I/O peripherals, etc.). FLPA relies on the identifica-tion of a set of functional blocks which strongly influencethe power consumption of the target platform. Once themodel is build, the estimation process consists of extract-ing the appropriate parameter values from the application,which will be injected into the model to compute the powerconsumption. Based on this methodology, the tool Soft-Explorer [3] was developed. It includes a library of powermodels for simple to complex processors. Recently, SoftEx-plorer has been included as a part of Consumption AnalysisToolbox (CAT) [3]. CAT gives relatively precise power esti-mation results in a surprisingly small time. Indeed, only astatic analysis of the code or a rapid profiling is necessaryto determine the input parameters for the power models.However, when complex hardware or software are involved,some parameters may be difficult to determine with preci-sion. For instance, this is the case of cache miss rates incomplex processors. This lack of precision may have a nonnegligible impact on the final estimation accuracy, depend-ing on the sensitivity of the parameter. Furthermore, anextension of this FLPA methodology is presented in [4] tomodel processor cores which feature a strong dependencyof the corresponding power consumption on the performedinstruction. According to this, a so-called hybrid functional-level/instruction-level power analysis (FLPA/ILPA) modelis elaborated effectively combining the low modeling andcomputational effort of an FLPA model and the higher ac-curacy of an ILPA model. To operate this methodology, weneed the real platform or licensed simulator. To overcomethis issue, McPAT [12] was introduced, which is an improvedmodel of Cacti [19] tool set. McPAT supports power, areaand timing estimation for multicore processors. McPAT useits XML interface to interact with the simulator in order tocollect the data needed by its power model. Here the majorissue is the accuracy of the power model and its estimation.In this paper, we compare the McPAT tool which is exe-cuted with help of the Multi2Sim [21] functional simulatorfor ARM and the proposed tool for the accuracy of powerestimation.

In order to make a better trade-off between power esti-mation time and accuracy, several studies have proposed

evaluating system power consumption at higher abstractionlevels. Almost all of these tools use micro-architectural sim-ulators to evaluate system performance and with the helpof analytic power models to estimate consumption for eachcomponent of the platform. SimplePower [22] is an exam-ple of available tool. In general, these tools rely on Cycle-Accurate (CA) simulation technique. In this work, we usegem5 full system simulator which supports the instructionset architecture of the state-of-the-art processors such as theARM Cortex family. Usually, to move from the RTL tothe CA level, hardware implementation details are hiddenfrom the processing part of the system, while preservingsystem behavior at the clock cycle level. The power con-sumption of the main internal units is estimated using powermacro-models, produced from lower-level characterizations.The contributions of the unit activities are calculated andadded together during the execution of the program on thecycle-accurate micro-architectural simulator. Though usingCA simulators has allowed accurate power estimation andsimulation time are fast for the off-the-shelf processor butpower modeling at RTL level take more time approximatelya month or even more depending on the architecture. In or-der to refine the value of sensible parameters with a reason-able delay, we propose in our work to couple the functionalpower modeling methodology with a substantially modifiedgem5 cycle-accurate simulator. Thus, a reasonable trade-off between estimation speed and accuracy will be reached.Similar type of work has been proposed in our previous workfor simple processor based platform [18], [17], [16] . In thiswork, we extend our tool for the power estimation of multi-core processor based platforms.

3. POWER ESTIMATION METHODOLOGYThis section exposes our proposed power estimation method-

ology that is divided into two steps as shown in Fig. 1.

3.1 Functional power modelsThe step 1 concerns the power model elaboration for the

system hardware components. In our framework, the FLPAmethodology is extended to develop generic power modelsfor different target platforms. The main advantage of thismethodology is to obtain power models which rely on thefunctional parameters of the system with a reduced num-ber of experiments. As explained in the previous section,functional power models come with few consumption laws,which are associated with the consumption activity valuesof the main functional blocks of the system. The generatedpower models have been adapted to the system-level, as therequired activities can be obtained from the simulator. Fora given platform, the generation of power models is done atonce. To do so, the processor architecture is divided into dif-ferent functional blocks and then to cluster the componentsthat are concurrently activated when the code is running.

There are two types of parameters: algorithmic param-eters algorithmic parameters that depend on the executedalgorithm such as the cache miss or instruction per cyclerates and architectural parameters that depend on the com-ponent configuration set by the designer such as the clockfrequency. For instance, Table 1 presents the common setof parameters of our generic power model. These sets ofparameters are defined for a general class of embedded pro-cessors. Additional parameters can be identified for specificprocessor microarchitectures such as superscalar. The next

Figure 4: Jumpers for OMAP3530Figure 5: Power measurement probes across the jumpers forOMAP3530

Figure 3: Measurement environment for OMAP4430,OMAP3530 and OMAP5912

Table 1: Generic power model parameters

Type Name Descriptionτ External memory access rate

Algorithmic γ Cache miss rate for a processorIPC Instruction Per Cycle

Architectural Fprocessor Frequency of the processorFbus Frequency of the busN Number of cores

step is the characterization of the embedded system powerconsumption when the parameters vary. These variationsare obtained by using some elementary assembly programs(called scenario) or built in test vectors elaborated to stim-ulate each block separately. In our work, characterizationis performed by measurements on real boards. Finally, acurve fitting of the graphical representation will allow usto determine the power consumption models by regression.The analytical form or a table of values expresses the ob-tained power models. This power modeling approach wasproven to be fast and precise.

3.2 Simulation frameworkThe step 2 of the methodology defines the architecture

of our power estimator that includes the functional powerestimator and fast gem5 simulator as shown in Fig. 2. Thefunctional power estimator evaluates the consumption of thetarget system with the help of the elaborated power modelsfrom the first step. It takes into account the architecturalparameters (e.g. the frequency, the processor cache configu-

ration, Instruction Per Cycle, etc.) and the application map-ping. It also requires the different activity values on whichthe power models rely. In order to collect accurately theneeded activity values, the functional power estimator com-municates with the gem5 simulator at system-level. Herewe would like to emphasize that the used gem5 simulatorconsists of two modes. The first mode executes the simula-tor at the cycle-accurate level using gem5 ARM ISA in or-der to capture the detailed activity during the run-time. Inthe second mode, we propose to combine the Just-In-Time(JIT) Dynamic Binary Translation (DBT) technique withour cycle-accurate processor model (ISA) in order to speed-up simulation time. Thus the combination of the above twotechniques leads to a hybrid simulation technique and thisis applicable only for single core processor models. The pro-posed hybrid simulation technique is similar to the one pro-posed by I. Borm et al. [6]. The combination of the abovetwo steps described at different abstraction levels (functionaland CA) leads to our proposed standalone power estimatorat system-level that gives a better trade-off between accu-racy and speed.

The vital function of this power estimation methodologyis to offer a detailed power analysis by the means of a com-plete simulation of the application. This process is initiatedby the functional power estimator through Application andProcessor Interface (Fig. 2). In this way, the mapping in-formation is transmitted to the fast gem5 simulator. Oursimulator consists of several hardware components which areinstantiated from the gem5 [8] library in order to build thetarget system simulator. In the power estimation step, thesimulator collects the activities that are influenced by the ap-plication and the input data. At the end of the simulation,the values of the activities are transmitted to the power con-sumption models or power estimator kernel using the activitycounter interface in order to calculate the global power con-sumption as illustrated in Fig. 2. As we have stated before,the following section will discuss the first step in particu-lar; the elaboration of the power model for the OMAP5912,OMAP3530 and OMAP4430 platforms by identifying thefunctional parameters related to the power consumption.

4. POWER MODEL GENERATIONIn this section, we show our power measurement environ-

ment and then we focus on the elaboration of the power mod-els based on the functional parameters affecting the powerconsumption for different embedded processor based plat-forms.

4.1 Power measurement environmentFig. 3 shows the measurement environment for OMAP3530,

Instruction Fetch Instruction Execute

Load/store

Instruction Memory (32kB) Data Memory (32kB)

Fetch/Decode Unit

Instruction Decode

L1 Cache Memory

Processing Unit (IPC_Core2)

Read access rate Write access rate

L1 Instruction miss rate L1 Data miss rate(ϒ1_Core2)

Frequency

L2 Unified Instruction/Data Cache (256 KB)

(F)Core clock system

Instruction Fetch Instruction Execute

Load/store

Instruction Memory (32kB) Data Memory (32kB)

Fetch/Decode Unit

Instruction Decode

L1 Cache Memory

Processing Unit (IPC_Core1)

Read access rate Write access rate

Memory

L1 Instruction miss rate L1 Data miss rate

L2 Instruction miss rate L2 Data miss rate

(ϒ1_Core1)

(ϒ2)

Frequency

L2 Unified Instruction/Data Cache (1MB)

(F)

Figure 6: Functional parameters of Arm Cortex-A9 proces-sor

OMAP4430 and OMAP5912 platforms composed of a powermeasuring instrument (Agilent LXI digitalizer) in a dedi-cated private network. The digitalizer accurately measuresthe static and dynamic current consumption across the re-sistors place. The Agilent Technologies L4532A 1 is a high-resolution, standalone LXI digitizer. It offers 2 channelsof simultaneous sampling at up to 20 MSa/s, with 16 bitsof resolution. Inputs are isolated and can measure up to±250V tohandlethemostdemandingapplications.

Fig. 4 and Fig. 5 show a simple way to take quickpower measurements on the OMAP3530 EValuation Module(EVM) 2 by using a multimeter to measure the voltage dropacross the jumpers J5, J6, and J9. Doing this will provide aninstantaneous power measurement and is a good represen-tation of the power consumed in a scenario where the powerprofile is relatively flat. For scenarios where power changesdrastically a multimeter might not present the full powerpicture. For such cases, we will need a more sophisticatedtool that can obtain and record several voltage readings overtime. For tools with their own built-in current measuringshunt resistors, we could remove the resistors on the EVM.

The EVM has three separate power rails: J5 for processorrail, J6 for interconnects and J9 for peripherals rail, and theEVM 1.8 supply rail. Each can be measured by measuringthe voltage on the specific jumper assigned to that particularrail. Fig. 4 will give an idea about the location of thejumpers for each rail on the board.

4.2 Power model developmentIn order to prove the usefulness and the effectiveness of

the proposed power estimation methodology, we used anARM9 3 architecture implemented into the OMAP5912 4, anARM Cortex-A8 5 based architectures implemented into the

1http://www.agilent.com/L4534A/20-msa-s-4-channel-lxi-digitizer2http://www.ti.com/tool/TMDSEVM35303http://infocenter.arm.com/arm.doc.arm94http://www.ti.com/product/omap59125http://www.arm.com/cortex-a/cortex-a8.php

OMAP3530 and ARM Cortex-A9 implemented into OMAP4430 6

platform. The OMAP5912 contains an ARM926EJ-S pro-cessor (16kB instruction cache and 8KB data cache). TheOMAP3530 contains an ARM Cortex-A8 processor (16kB,2-way set associative instruction and data caches and 256kBL2 cache). The OMAP4430 contains a dual core ARMCortex-A9 processor (32kB, 4-way set associative instruc-tion and data caches for each core and a common 1MB L2cache).

As explained above, we used the different functional pa-rameters to generate generic power models for the targetsystem. As a first step, we divided the architecture intodifferent functional blocks such as the processor, the mem-ory system, the pipeline stage unit, etc. as shown in theFig. 6 for an ARM Cortex-A9 processor. The second stepis the characterization of the power model by varying theparameters. These variations are obtained by using someelementary assembly programs (called scenarios) or built intest vectors elaborated to stimulate each block separately.Characterization can be performed by measurements on realboards. Finally, a curve fitting of the graphical representa-tion allows us to determine the power consumption laws byregression. The analytical form or a table of values expressesthe obtained power laws. This power modeling approach wasproven to be fast and precise. In our work, this approachhas been applied to model power consumption for processor,memory and functional units.

y = 0,1336x + 0,2803R² = 0,9725

0

0,5

1

1,5

2

2,5

0

100

200

300

400

500

600

cost in power (mW) Instruction per cycle Linear (cost in power (mW))

Assembly benchmark application

Pow

er (m

W)

Inst

ruct

ion

Per C

ycle

(IPC

)Figure 7: Power consumption cost according to the Instruc-tion Per Cycle (IPC) for ARM Cortex-A9 processor

Processor power model: Table 3 shows the power con-sumption models for the ARM9, ARM Cortex-A8, and ARMCortex-A9. The input parameters on which the power mod-els rely on are the frequency of the processor (Fprocessor(MHz)),Instruction Per Cycle (IPC), and the cache miss rate (0 <γ < 100 (%)). The system designer chooses the frequencyof the processor and the bus while the cache miss rate andthe IPC are considered as an activity of the processor, whichcould be extracted from the simulation environment. In ta-ble 3, there is an extra parameter for the L2 cache i.e.,γ2.The constants b and c will change depending on the numberof pipeline units and cache configurations.

6http://www.ti.com/product/omap4430

Table 2: JPEG decoder application simulation results of L1 cache miss rates for ARM Cortex-A9 single core

Program Instruction miss rateRead Miss rateWrite Miss RateTotal Miss RateVariable Length Decoding (VLD) 0.003386 3.56 31.73 0.02Zigzag scan (ZZ) 0.001128 3.03 99.91 5.64De-quantization (DQ) 0.002283 4.49 40.72 3.88Inverse Discrete Cosine Transform (IDCT) 0.000812 2.06 99.88 5.58Color Conversion 0.004375 4.58 20.11 0.85Reordering 0.298380 3.05 25.19 2.87

0

0,5

1

1,5

2

2,5

3

3,5

4

4,5

1200

1250

1300

1350

1400

1450

1500

VLD ZZ DQ IDCT ColorConversion

Reordering

Power measured (mW) Power estimated (mW) Error (%)

JPEG decoder task

Pow

er (m

W)

Erro

r (%

)

Figure 8: Power estimation accuracy for the JPEG decoderapplication (ARM Cortex-A9 at 1 GHz)

Table 3: Generic power models for different processors

Processor Power modelsARM9 P (mW ) = 1.03 FProcessor+0.6 (γ)+5.3ARMCortex− A8 P (mW ) = 0.79FProcessor+18.65 IPC+

0.26 (γ1 + γ2) + 10.13ARMCortex −A9(singlecore)

P (mW ) = 0.7 Fprocessor+236.54 IPC+0.67 (γ1) + 1.4 (γ2) + 12.45

ARMCortex− A9 P (mW ) = 0.7 Fprocessor +

b2∑

i=1

(IPCc1−c2) + 1.4 (γ2)

(dualcore) +c2∑

i=1(γ1c1−c2) + 12.45

Fig. 7 shows the processor power consumption variationon application of different assembly benchmarks. From theFig. 7, we are able to identify that the processor powervariation can span within a range from 55 mW to 490 mWdepending on the execution of different assembly code andit consumes 20% to 40% of the total power of the processor.From this, we identify the IPC (Instructions per Cycle) asan important metrics to characterize the power of modernprocessors. The explanation for this correlation lies in thefact that in a complex superscalar processor, a dominantportion of the power is consumed by the system used toexploit instruction level parallelism.

5. SYSTEM-LEVEL POWER ESTIMATIONRESULTS

For the second step of our power estimation tool, acycle-accurate prototype of an ARM Cortex-A9, an ARMCortex-A8 and an ARM9 based architecture has been devel-oped. This prototype uses different component models andcache model provided with the gem5 for cache miss rate,

Erro

r (

%)

Po

we

r (

mW

)

Benchmark applications

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

115

120

125

130

135

140

145

basicmath bitcount qsort susan JPEG MP3Encoder

M-JPEG2000 X264encoder

MPEG audiodecoder

JPEG2000decoder

Power estimation (mW) Power measurement (mW) Error (%)

Figure 9: Comparison of power estimation accuracy of ourproposed tool (ARM940T at 120 MHz) vs Real Board Mea-surement

and the ARM ISA for the target processor. Furthermore,the cache parameters and the bus latencies are set to emu-late the real platform behaviour. From the available cachemodel and pipeline model, we are able to determine the oc-currences of the main activities. For the ARM Cortex-A9processor the following counters are used for different cachemiss rates: read data miss, write data miss and read in-struction miss, and Instruction Per Cycle (IPC). We usedthe JPEG decoder as an main benchmark for ARM Cortex-A9 processor. The JPEG decoder application consists of 5main tasks: Variable Length Decoding (VLD), Zigzag scan(ZZ), De-quantization (DQ), Inverse Discrete Cosine Trans-form (IDCT), Color Conversion and Reordering.

Table 2 shows the detailed results of the activities fetchedby the fast cycle-accurate simulator for each task of theJPEG application for an ARM Cortex A9 processor. Fromthese results several remarks can be drawn. First, we cannotice that instruction cache miss rates and read data missrates are very low when compared to write data miss rates.This is due to the reduced task kernel and data pattern sizesthat are very low compared to the cache size (32 kB), whichdecreases the access to the external memory, thus having aminimal effect on the dynamic power consumption. Second,the data write miss rates have a high impact on the totalpower consumption of the system. This is because of thealgorithm structure, which does not favour the reuse of dataoutput arrays and the usage of cache policy. Therefore, thestatistics collected in Table 2 could help in tuning the ap-plication structure for a better optimization of the system

power consumption. In a similar fashion, we extracted theactivities for the ARM Cortex-A8 and ARM9 processors.

5.1 Estimation accuracyIn the next step, we estimated the total power consump-

tion of each task using the power models shown in Table3 for an ARM Cortex-A8 single core processor. Fig. 8 il-lustrates the results and shows the comparison between theproposed tool and the real board measurements. First, ourpower estimator has a negligible maximum error, around4.6%. This study offers a detailed power analysis for eachtask in order to help designers to detect peaks of consump-tion and thus to propose efficient mapping or optimizationtechniques. In order to evaluate the accuracy of our tool, wecarried out power estimation on several image & signal pro-cessing benchmarks. Fig. 9 and Fig. 10 illustrate the powerestimation results by showing the comparison between theproposed power estimation tool and the real board measure-ments. Our proposed tool has a negligible average error of1.24% and 2.4% respectively for ARM9 and ARM Cortex-A8 single core processor based platforms, which offers betteraccuracy.

Erro

r (

%)

Po

we

r (

mW

)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

270

280

290

300

310

320

330

340

350

basicmath bitcount qsort susan JPEG MP3Encoder

M-JPEG2000 X264encoder

MPEG audiodecoder

JPEG2000decoder

Power estimation (mW) Power measurement (mW) Error (%)


Figure 10: Comparison of power estimation accuracy of ourproposed tool (ARM Cortex-A8 at 500 MHz) vs Real BoardMeasurements

In order to estimate power for a dual core ARM Cortex-A9 processor, we need to select proper task partitioning forthe application under experimentation. Careful study of theapplication has been done to find the match between JPEGdecoder and the dual core processor platform. The com-pressed image data is connected to the VLD in the JPEGdecoder. Therefore, the VLD must be executed by the firstcore of the processor. In order to divide the ZZ, DQ, IDCTand color conversion over the two core, the data consump-tion and production rate of the various parts of the systemare looked upon. The VLD consumes data from the outsideworld and produces data in blocks. The zigzag scan, DQand IDCT tasks also consume and produce one block at atime. The color conversion and re-ordering requires one ormore (up to 10) blocks before they can run. The color con-version however produces data in a block-by-block basis andsends this to the re-ordering unit which then produces out-put data. This implies that the communication over connec-

Err

or

(%)

Po

we

r (m

W)


0

5

10

15

20

25

30

35

0

500

1000

1500

2000

2500

nsines rdf rffr tdpf tdhpf tdlpf M-JPEG2000 X264encoder

Downscalar JPEG2000decoder

Power measurement (mW) System-Level Power Estimation (mW) McPAT Power Estimation (mW)

System-Level Power Estimation Error (%) McPAT Error (%)

Figure 11: Comparison of power estimation accuracy of ourproposed tool (Dual-core ARM Cortex-A9 at 1GHz) vs RealBoard Measurements

tion 2 of our two processor system is always in blocks. Thusevery division of the JPEG decoder in two cores requires thesame data rate. The subdivision of the JPEG decoder doesnot influence the communication load of the system. Thischoice enables almost 50-50% of load sharing among dualcore platform. It also has the advantage that the Huffmandecoding and de-quantization tables required by the VLDand DQ units respectively do not need to be shared by bothprocessors. In a similar way, we have executed several otherapplications and their results are illustrated in the Fig. 11.Fig. 11 also presents the power estimation results by show-ing the comparison between the proposed power estimationtool, McPAT tool and the real board measurements. Ourproposed tool has a negligible average error of 3.4% on adual core ARM Cortex-A9 processor based platforms andMcPAT has an average error of 23%. We conclude that theproposed tool offers a better accuracy when compared withwidely used state-of-the-art tool.

6. CONCLUSIONThis paper presents an efficient system-level power es-

timation tool for ARM processor-based OMAP platforms.Indeed, power/energy constraints are considered as a majorchallenge when the system runs on batteries. Thus, design-ers must take these constraints into account as early as pos-sible in the design flow. First, a power modeling method-ology has been defined to address the global system con-sumption that includes processors, memory, etc. Second,the functional power modeling part is coupled with a fastvirtual platform to obtain the needed micro-architecturalactivities for the power models, which allows us to reach ac-curate estimates. Our proposed system-level power estima-tion tool explores these two aspects and offers an accurateand fast system-level power estimation. Experimental re-sults show that our tool exhibits less than 4% average errorcompared with the real measurements compared to the otherstate-of-the-art simulation power estimation tools. The fu-ture works of this project will focus more on thermal aspectand dynamic load balancing based on energy and power at

the system-level for more complex heterogeneous platforms.Furthermore, in order to obtain more accurate power esti-mations, some power model refinements must be realized.

AcknowledgmentsThe research leading to these results has received fundingfrom the European Community’s Seventh Framework Pro-gramme [FP7/2007-2013] under the ParaDIME Project (www.paradime-project.eu), grant agreement no 318693.

7. REFERENCES[1] F. Bellosa. The benefits of event: driven energy

accounting in power-sensitive systems. In Proceedingsof the 9th workshop on ACM SIGOPS Europeanworkshop: beyond the PC: new challenges for theoperating system, EW 9, pages 37–42, New York, NY,USA, 2000. ACM.

[2] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt,A. Saidi, A. Basu, J. Hestness, D. R. Hower,T. Krishna, S. Sardashti, R. Sen, K. Sewell,M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. Thegem5 simulator. SIGARCH Comput. Archit. News,39(2):1–7, Aug. 2011.

[3] D. Blouin and E. Senn. Cat: An extensiblesystem-level power consumption analysis toolbox formodel-driven design. In NEWCAS Conference(NEWCAS), 2010 8th IEEE International, pages 33–36, june 2010.

[4] H. Blume, D. Becker, L. Rotenberg, M. Botteck,J. Brakensiek, and T. G. Noll. Hybrid functional- andinstruction-level power modeling for embedded andheterogeneous processor architectures. J. Syst. Archit.,53(10):689–702, Oct. 2007.

[5] A. Bogliolo, L. Benini, and G. D. Micheli.Regression-based rtl power modeling. ACMTransaction on Design Automation of ElectronicSystems, 5:2000, 2000.

[6] I. Bohm, B. Franke, and N. Topham. Cycle-accurateperformance modelling in an ultra-fast just-in-timedynamic binary translation instruction set simulator.In Embedded Computer Systems (SAMOS), 2010International Conference on, pages 1 –10, july 2010.

[7] C. Brandolesec. A Codesign Approach to SoftwarePower Estimation for Embedded Systems. PhD thesis,Politecnico di Milano, Institute of Electronics andInformation, 2000.

[8] A. Butko, R. Garibotti, L. Ost, and G. Sassatelli.Accuracy evaluation of gem5 simulator system. InReconfigurable Communication-centricSystems-on-Chip (ReCoSoC), 2012 7th InternationalWorkshop on, pages 1 –7, july 2012.

[9] C. X. Huang, B. Zhang, A.-C. Deng, and B. Swirski.The design and implementation of powermill. InM. Pedram, R. W. Brodersen, and K. Keutzer, editors,Proceedings of the 1995 International Symposium onLow Power Design 1995, Dana Point, California,USA, April 23-26, 1995, pages 105–110. ACM, 1995.

[10] K. Jeong and A. B. Kahng. A power-constrained mpuroadmap for the international technology roadmap forsemiconductors (itrs), 2010.

[11] J. Laurent, N. Julien, E. Senn, and E. Martin.Functional level power analysis: an efficient approach

for modeling the power consumption of complexprocessors. In Design, Automation and Test in EuropeConference and Exhibition, 2004. Proceedings,volume 1, pages 666 – 667 Vol.1, feb. 2004.

[12] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M.Tullsen, and N. P. Jouppi. McPAT: an integratedpower, area, and timing modeling framework formulticore and manycore architectures. In MICRO 42:Proceedings of the 42nd Annual IEEE/ACMInternational Symposium on Microarchitecture, pages469–480, New York, NY, USA, 2009. ACM.

[13] F. Najm. A survey of power estimation techniques invlsi circuits. Very Large Scale Integration (VLSI)Systems, IEEE Transactions on, 2(4):446 –455, dec.1994.

[14] Y. Patt, P. Foglia, E. Duesterwald, P. Faraboschi, andX. Martorell. High Performance EmbeddedArchitectures and Compilers: HiPEAC 2010, Pisa,Italy. Lecture Notes in Computer Science / TheoreticalComputer Science and General Issues. Springer, 2010.

[15] N. Potlapally, A. Raghunathan, G. Lakshminarayana,M. Hsiao, and S. Chakradhar. Accurate powermacro-modeling techniques for complex rtl circuits. InVLSI Design, 2001. Fourteenth InternationalConference on, pages 235 –241, 2001.

[16] S. Rethinagiri, R. Atitallah, and J. Dekeyser. Asystem level power consumption estimation for mpsoc.In System on Chip (SoC), 2011 InternationalSymposium on, pages 56–61, 2011.

[17] S. Rethinagiri, R. Ben Atitallah, S. Niar, E. Senn, andJ. Dekeyser. Fast and accurate hybrid powerestimation methodology for embedded systems. InDesign and Architectures for Signal and ImageProcessing (DASIP), 2011 Conference on, pages 1–7,2011.

[18] S. K. Rethinagiri, R. Ben Atitallah, J.-L. Dekeyser,E. Senn, and S. Niar. An efficient power estimationmethodology for complex risc processor-basedplatforms. In Proceedings of the Great LakesSymposium on VLSI, GLSVLSI ’12, pages 239–244,New York, NY, USA, 2012. ACM.

[19] S. Thoziyoor and N. Muralimanohar. Cacti 5.0, 2007.

[20] V. Tiwari, S. Malik, A. Wolfe, and M. T.-C. Lee.Instruction level power analysis and optimization ofsoftware. In Proceedings of the 9th InternationalConference on VLSI Design: VLSI in MobileCommunication, VLSID ’96, pages 326–, Washington,DC, USA, 1996. IEEE Computer Society.

[21] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli.Multi2sim: a simulation framework for cpu-gpucomputing. In Proceedings of the 21st internationalconference on Parallel architectures and compilationtechniques, PACT ’12, pages 335–344, New York, NY,USA, 2012. ACM.

[22] N. Vijaykrishnan, M. Kandemir, M. J. Irwin, H. S.Kim, and W. Ye. Energy-driven integratedhardware-software optimizations using simplepower.In Proceedings of the 27th annual internationalsymposium on Computer architecture, ISCA ’00, pages95–106, New York, NY, USA, 2000. ACM.

Documents

System-Level Power Estimation Tool for Embedded …rabie-ben-atitallah.com/paper/rapido-2014.pdf · Smail Niar LAMIH, Université de Valenciennes [email protected] Osman