icm04

Embed Size (px)

Citation preview

  • 8/8/2019 icm04

    1/4

    Power Consumption Awareness in Cache

    Memory Design with SystemC

    Smail NIAR*, Samy MEFTALI, Jean-Luc DEKEYSER

    INRIA-FUTURS, DART Project, University of Lille, France

    [niar, meftali, dekeyser]@lifl.fr

    Abstract*

    This study presents the development of a cachememory module in a component library, designed

    for fast and synthetic embedded system simulation.This paper demonstrates also the possibility of

    integrating an existing power consumptionanalytical model in a SystemC description at thecycle-accurate register-transfer level (RTL).

    Keywords: SystemC, power consumption,

    processor, cache.

    1. Introduction

    Energy consumption issues play an increasinglyimportant role in the design of new electronicdigital systems [1]. This change in designer attitudeis primarily motivated by a desire to increase

    battery autonomy in embedded and mobile systems,to take the thermal issues affecting the cooling, packaging and reliability of embedded and high performance systems into account, and finally tomanage the environmental impact of mobilecomputer systems.

    In addition, over the last few years, important progress has been made in the field of integratedcircuit technology. High-performance, low-costembedded systems have been designed using asystem-on-chip (SoC) approach [2]. A side effect ofthis development is that SoC have become more

    and more complex, requiring high-level tools (forsimulation, performance estimation and synthesis)during the design phase. The SystemC language is aC++ library. Its aim is to facilitate complex systemdesign by supporting hardware and system-levelmodelling [3]. However, although over the last few

    years, many research projects have been devoted toimproving and facilitating simulation with SystemC,

    very little attention has been paid to the question ofpower consumption evaluation in SoCdesign usingthis language [4][5].

    To remedy this lack, we have designed a

    SystemC module library that serves as a frameworkfor a new design methodology dedicated to

    embedded systems. These modules allow easy

    performance (execution time) and energy

    *Also at the University of Valenciennes, France

    consumption estimations. With our designmethodology, SoC descriptions are synthetic,modular and accurate. These three characteristicsare very important in that:

    Synthetic descriptions permit better functionaland structural understanding of the SoC. Theymake it possible to have several abstractionlevels in the same project, thus offering acompromise between precision and speedduring simulation.

    Modular descriptions make it possible to reuseexisting modules to design new SoC, with theonly cost being the separation of the SoC's

    "implementation" and the "functional" aspects.

    Accurate and detailed descriptions bothguarantee that the performances measured by

    simulation are equivalent to those of the finalSoC hardware and prevent any ambiguity in

    the ultimate implementation phase.This paper provides a detailed description of one

    of the library modules, the cache memory module,at the cycle-accurate RTL. Briefly, this cache

    module has the following features:

    Modular power consumption evaluation, basedon an analytic model.

    Modular SystemC-based specifications, formodule reuse.

    Cache memory configuration exploration, fordetermining the best cache configuration foreach application for the SOC.

    2. Function and importance ofcache memory for anembedded system

    New applications such as multimedia, image processing, telecommunication and networkapplications are memory-centric and requireprocessing more and more data in less and less time.For this reason, growing numbers of embeddedhardware platforms integrate ever-increasing cache

    sizes. For instance, the new intel Xscale coreembedded processor has two 32-way associative32KB data caches (for instructions and data). The

    size of the instruction and data caches in the newMIPS32 architecture can range from 256 bytes to

  • 8/8/2019 icm04

    2/4

    4Mbytes. This tendency will most likely continue infuture embedded processors because of new the

    applications' needs in terms of memory bandwidth.One of the consequences of this trend is that thenumber of transistors and the space devoted tocaches have also increased. In some embedded

    hardware platforms (such as the Intel ArmStrong),the area taken by the cache memory can attain 50%

    of the total core, for a power consumption of up to50% of the total power consumption of themicroprocessor system. In addition to their impacton performance, most cache structures areindependent of the processor architecture and theinstruction sets. Given this context, we chose to

    present an example describing an on-chip cache.To evaluate the access time, and the powerconsumption of the cache module in our library, weused an existing analytical cache model, namelyCacti. This model is an integrated access time,

    power consumption and chip area model for on-chip cache memories and it supports multibankedcaches. In the Cacti model, each bank is composedof several units: arrays for storing tags and data, tagcomparators and multiplexers for selecting a word(typically 8 bytes) out of a cache line consisting ofB bytes. In this paper, only one bank is considered.

    In order to evaluate the access time, the per-accessenergy consumption and the chip space using Cacti,the user must determine the following parameters:

    S: total size in bytes B: block size in bytes Assoc: associativity T: technology size (0.1 m by default) Pread: the number of input Pwrite: the number of output Pread_write: the number of input-output

    ports.In this study, Pread=Pwrite=0 and Pread_write=1.Using these parameters, Cacti determines the bestlayout (or configuration) that will optimize both theaccess time and the energy consumption. . Moredetails about the power consumption model used byCacti are given in [6,7].

    3. Cache memory with SystemCThe SystemC library is object oriented and allows aclear separation between structures and behavioursof architectural components. It permits alsohierarchical designs (hierarchical sc_module).

    SystemC offers also several design possibilities atseveral abstraction levels. In fact, it contains bothhigh level data types and low levels ones. Theselater allow bit-accurate, cycle-accuratespecifications which are able to give accurateperformance estimations. For all these reasons we

    decide to specify our library using SystemC.Figure 1 shows the position of our module as a

    level 1 (L1) cache. The figure also shows thecache's communication interfaces with the

    processor as well as with the next memory level,which may be either the second cache level or the

    main memory. The protocol used for implementingprocessor-cache and cache-nextLevelMemory is thesame. It is an asynchronous protocol and uses 3control signals (request, write, and ack) and two

    buses (address and data). Transfers are engaged on behalf of either the processor (when executing a

    memory instruction i.e. load or store) or the L1cache (when a cache miss occurs).

    Figure 1. The cache module as a level 1 (L1) cache

    and their transfer protocols.

    When the processor decodes a memory instruction,the request signal is asserted. If the referenced block is present, then the Ack signal from the L1cache to the processor is affirmed, and the operation,either read (write=0) or write (write=1), is

    performed in the cache. Otherwise, the block is firsttransferred from the next memory level, and onlythen is the Ack signal asserted. More details aboutthe data transfer protocols are presented in figure 2(in the page), which illustrates 3 data transfers in asystem. Due to space limitation in this paper, the

    memory access latency is fixed to zero (Lat=0). Thecache-to-memory bus width is twice as wide as itscache block size (Blocsize=2). In the firsttransaction, there is a miss in address 0 (3 cycles).In the second transaction, there is a hit (1 cycle).The third transaction in figure 2 shows the

    beginning of a cache miss at address 1000, whichgenerates a conflict with block 0. This block mustthen be saved (3 cycles) before loading the newblock (3 cycles).Figure 3 depicts the internal structure of the cache.It consists of 4 unit types: the decoder, Assoc banks,the replacement policy logic, and the cache

    controller logic. One SystemC method (sc_method)is associated to each one of these units.Connections between these modules areimplemented by signals (sc_signal) through ports.The bank unit stores both tags and data and thecomparator logic is used to check the match

    between the requested block and selected block inthe bank. After this comparison, a hit signal is sentto the cache controller logic, which sends the Acksignal to the CPU. The replacement policy logicholds block histories and, in the case of conflict,determines which block to eject from the cache.

    Several policies are available: FIFO, LRU, and

    random.

    ProcessorL1Cache

    NextMemoryLevel

    adress

    $req

    write

    Ack

    adress

    memReq

    memWrite

    Data

    Ack

    DataProcessor

    L1Cache

    NextMemoryLevel

    adress

    $req

    write

    Ack

    adress

    memReq

    memWrite

    Data

    Ack

    Data

  • 8/8/2019 icm04

    3/4

    .

    Figure 2. Three data transfers in the cache

    Figure 3. Internal structure of the cache memory

    The cache uses the write-allocate policy to deal

    with write misses. The power consumptionevaluation is performed by attaching the Cactimodel to the cache controller. In fact, when thecache is declared in the SystemC description, thecache configuration parameters are used to evaluatethe access time and the energy consumption for

    each access to the cache.These two values are stored by the cache module.These values in conjunction with activity statisticsof the cache-module (number of accesses with hits,misses, external bus access, etc.) are used toevaluate the total execution time in cycles, as wellas the total energy consumed by the cache at the

    end of the simulation.

    4. Using the cache module in aSystemC SoC description

    Our cache modules can be used in two differentways. First, they can be used separately to analyzethe cache performance of a given application. Inthis case, the cache is activated by the followingcommand: sc-cacheAnal f -config

    where sc-cacheAnal is the SystemC cache name,and represents the file containing the

    list of memory access addresses generated bymemory tracing during functional simulation. The

    parameter corresponds to the cache

    configuration file. The configuration file contains

    the following parameters:nlines bsize assoc readPorts writePorts

    readWritePorts

  • 8/8/2019 icm04

    4/4

    start .....

    Cacti Statistics:Main Memory configuration: latency = 2

    Cache configuration:Size in bytes: 8192

    Number of sets: 128

    Associativity: 2Block Size (bytes): 32Read/Write Ports: 1

    Read Ports: 0 Write Ports: 0Technology Size: 0.35um Vdd: 2.6V

    Access Time (ns): 2.19856Power (nJ): 3.37432Best Ndwl (L1): 1 Best Ndbl (L1): 2

    Time Components:data side (with Output driver) (ns): 1.70219

    tag side (with Output driver) (ns): 2.19856decode_data (ns): 0.405051

    (nJ): 0.075142

    wordline and bitline data (ns): 0.601265

    etc.

    compare (ns): 0.557825(nJ): 0.0110586

    *******************

    SYSTEMC CACHE POWER AWARE SIMULATOR

    ****************

    Cache Configuration :LSU to Dcache Bus width in bytes : 4Dcache to Mem bus width in bytes : 8

    Statistics :Load / Store Instruction Nbr : 121118

    SystemC: simulation stopped by user.simulation time : 5.53403 seconds#cycles : 131330

    #Miss: 1733 #Hit : 119385#Cache Bloc Read : 121118

    #Cache Bloc Write: 38107Power per access : 3.37432e-09

    Total power in Cache (J) = 0.000537276

    Figure 4. Statistics report for an application example

    1.3

    1.5

    1.7

    1.9

    2.1

    2.3

    16 32 64 128 256

    block size in bytes

    excutiontimein10**6cycles

    assoc=1

    assoc=2

    assoc=4

    assoc=8

    Figure 5. Execution time in million of cycles

    0

    50

    100

    150

    16 32 64 128 256

    Block size in bytes

    Totalenergyconsumption

    inmJ

    assoc=1

    assoc=2

    assoc=4

    assoc=8

    Figure 6. Total energy consumption

    The first set corresponds to those given by Cactiand are related only to the cache configuration andnot to the application. Cacti also reports the powerand access time contribution of each cache

    component (decoder, wordline, bitline, etc).The second set of statistics corresponds to

    application performance. It consists of the numberof memory references, the number of cycles neededto execute these memory references, the number ofhits and misses in the cache, and the total energy

    consumed by the cache.Figures 5 and 6, respectively, present the total

    execution time (in millions of cycles) and the total

    energy consumption in milliJoule Joule (mJ) forexecuting the merge sort program on an array of 20000 elements. This program generates 1 409 836

    memory references. The optimal value for the cache

    associativity or the block size for a givenapplication (figures 5 and 6) will depend on therelative weight of the execution time and the powerconsumption. These experiments show that it is possible to use our cache description in a designspace exploration to determine the best cache

    configuration for a given (set of) application .

    5. Conclusion and perspectives

    After presenting the original aspects of ourcomponent library, we described the cache modulestructure in detail. This SystemC description allows

    accurate performance analysis as well as accurateevaluations of the cache's energy consumption. Inthe near future, multi-banked caches will beavailable in our library [8], and the library will beenhanced by several other components (processors,dram, Bus, etc.).

    6. References:

    [1] T. Mudge. Power: A first class designconstraint, IEEE Computer,April 2001.

    [2] G.Martin H.Chang, Winning the SoCRevolution, Kluwer Academic Publi.

    [3] www.systemc.org[4] www.microlib.org[5] Orinoco, www.chipvision.com[6] S. Wilton and N. Jouppi. An Enhanced

    Access and Cycle Time Model for On-Chip Caches. Research Report WRL 1994.

    [7] P. Shivakumar and N. P. Jouppi. CACTI3.0: An integrated cache timing, power,and area model, Research Report WRL01.

    [8] S. Niar, L.Eekhout, K.DeBosschere,Comparing multiported cache schemes.Inter. Conf. on Parallel and DistributedProcessing Techniques and Appli., 2003.