Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented

Cache Conscious Allocation of Pointer Based Data

Structures, Revisited with HW/SW Prefetching

by: Josefin Hallberg, Tuva Palm and Mats Brorsson

Presented by: Lena Salman

1

This paper combines four hardware schemes, normallynot efficient on pointer-based data structures, and a greedysoftware prefetch with cache-conscious allocation to evaluatepositive effects of increased locality, in a comparativeevaluation, on five level 1 data cache line sizes.We show that cache-conscious allocation utilizes largecache lines efficiently and that none of the prefetch strategiesevaluated add significantly to the effect alreadyachieved by the cache-conscious allocation on the hardwareevaluated. The passive prefetching mechanism of usinglarge cache lines with cache-conscious allocation is byfar outstanding.studies of cache-concious allocation and software prefetch have sparked some interest in the area,hardware - prefetch research has showed some improvements, but not when using randomly accessed data structures

IntroductionPointer-based data structures are usually randomly allocated in memory

Will usually not achieve good locality

Higher miss rates

Software approach against Hardware approach

Software approachTwo techniques: Cache concious allocation -

By far the most efficient Software prefetch –

Better suited for automization, better for implementation in compilers.

Combination of both cache-concious allocation and software prefetch, does not add significantly to performance

Hardware approach

Calculating and prefetching pointersCalculating pointer dependenciesEffects of effectively predicting what to evict from the cacheGeneral HW prefetch –

More likely to pollute the cacheProblem! All the hardware strategies take advantage of the increased locality of cache – consciously allocated data.

Prefetching and cache – concious allocation

Should complement each other’s weakness – Reduce the prefetch overhead of fetching

blocks with partially unwanted data.Prefetching should reduce the cache

misses and miss latencies between the nodes

1

both HW and SW should complement each other weakness's and the allocation should reduce the prefetch overhead and the prefetch should reduce the miss latency between individual nodes or data.we would want the data in one block, and we want close blocks, close to one another.

Cache-conscious allocation

Excellent improvement in execution time performance

Can be adapted to specific need by choosing the cache-conscious block-size

(cc – block size)

Attempts to co-allocate data in the same cache-line. Nodes are referenced after each other on the same cache line.

1

by far showed terrific resultscan be adapted manually to block size - depending on the structure and depending on th cache.cc - block size can be manually changed.

Allocation to improve locality

Cache – conscious allocation

Attempts to allocate the data in the same cache- line

Better locality can be achieved

Improved cache performance by a reduction of misses

ccmalloc()

Does the cache-concious allocation of memory.

Takes extra argument – pointer to data structure that is likely to be referenced.

#ifdef CCMALOC

child=ccmaloc(sizeof(struct node), parent)

#else

child= malloc(sizeof(struct node));

#endif

ccmalloc()

Takes pointer to data that is likely to be referenced close ( in time) to the newly allocated stuctureInvokes calls to the standard malloc():

When allocating new cc-block When data is larger than cc-block

Otherwise: allocate in the empty slot of cc-block

1

ccmaloc() uses the locality, for example between parents and children in a tree, and will allocate the children close to their parents, depend of course on the cc-block size, which can be set by the programmer.

cc-blocks

= Cache – conscious blocksDemands cache lines large enough to contain more than one pointer structureThe bigger the blocks – the lower the miss-rate if allocation is smart.Can be set dynamically in software, independently of the HW cache line size.In our study cc-block size– 256Bhardware cache line size – 16B – 256B

Prefetch

Prefetching will reduce the cost of cache – miss

Can be controlled by software and/or hardware

Software results in extra instructions

Hardware leads to complexity in hardware

1

in our experimnets we will talk about level 1 data cache only

Software controlled prefetch

Implemented by including prefetch instruction in the instruction set

Should be inserted well ahead of reference, according to prefetch algorithm

In this study: we will use greedy algorithm, by Mowry et al.

Software prefetch – Greedy algorithm

When a node is referenced, it prefetches all children at that node.Without extra calculation, can only be done to children, not to grandchildrenEasier to control and optimizeThe risk of polluting the cache decreases

(since prefetch only needed lines)

Software greedy prefetch

Hardware Controlled Prefetch

Depending on the algorithm used, prefetching can occur when a miss is caused

Or when a hint is given by the programmer through an instruction,

Or can always occur on certain types of data

Hardware prefetch

Techniques used:

Prefetch on miss

Tagged Prefetch

Attempt to utilize spatial locality

Do NOT analyze data access patterns

Prefetch-on-Miss

Prefetches the next sequantial line i+1,

when detecting miss on line i.

Line i : Miss!

Line i-1

Line i+1 : will be prefetched

1

each miss in the cache will lead to the fetch of two ines into the cache, if the line to prefetch is nbot already in the cachedrawback: could lead to a lot of unused data in teh cache, as it prefetches the next cahce line on every miss. - those lines could not necessarily be useful!!Performance of prefetch-on miss is decided by the regularity of data references and their locality

Tagged Prefetch

Each prefetched line is tagged with a tag

When a prefetched line - i is referenced, the line i+1 is prefetched.

(no miss has occurred)

Efficient when memory is fairly sequential, and has been shown efficient

1

This is an efficient prefetch method to use when memoryis referenced fairly sequentially, and has been shownin studies without pointer-based data structures, [20, 21],to provide up to twice the performance improvementsof prefetch-on-miss. However, as with prefetch-on-miss,prefetches are done indiscriminately on every miss and onreferencing a prefetched line in the level 1 cache, riskingunused data in the cache.

Pre-fetch on miss – for ccmalloc()

HW prefetch can be combined with ccmalloc(), by introducing a hint with address to the beginning of such a block.

Prefetch-one-cc on missPrefetch the next line after detecting a cache – miss on a cache-conciously allocated block.

Miss!!

Prefetch-all-cc on missDecides dynamically how many lines to prefetch.

Depends on where on cc-block the missing cache line is located.

Prefetches all the cache lines on the cc-block, from the address causing miss

Miss!!

Experimental Framework

MIPS-like, out-of-order processor simulator.

Memory latency equal to 50 ns random access time.

Benchmarks: health – simulates columbian health care system mst – creates graph and calculates minimum span

tree perimeter – calculates the perimeter of image treeadd – calculates recursive sum of values

More about benchmarks:

health – elements are moved between lists during execution, and there is more calculation between data.

mst – originally used a locality optimization procedure, which made ccmalloc() non noticeable.

perimeter – data allocated in an order similar to access order, resulting locality optimization.

treeadd – has calculation between nodes in a balanced binary tree.

1

health uses doubly-linked trees to model a hierarchyof hospitals and a set of doubly-linked lists to track thepatients undergoing treatment at each hospital. The simulationis time-based; on every iteration each hospital is processed.Patients periodically show up at hospitals andundergo a multi-step treatment process: 1) they wait for afree physician to attend them, 2) the physician assesses theirailment and either transfers them to the next hospital up thehierarchy (back to step (1) at a new hospital), or 3) the physiciantreats them. There are fixed latencies for assessmentand treatment, but the time spent waiting depends on congestionat the hospital. There is a fixed probability that apatient shows up each cycle and that the patient needs to bereferred to the next level of the hospital. The whole benchmarkis less than 500 lines of C code.

Results: Execution time

1

the concept of stall is not well defined. We use thesame definition as has been done in many other previousstudies: when the maximum number of instructions are retiredin a clock cycle, that cycle is counted as busy. Otherwise,we say that the cycle is stalled due to the oldestinstruction waiting to be retired. If that is a load- or storeinstruction, it is a memory stall, otherwise it is a FU stall. Ifthere is no instruction waiting to be retired, the stall is saidto be a fetch stall. This means that busy time is the fractionof all clock cycles when the full issue width can be utilized.The optimization strategies are likely to have the greatesteffect on the benchmarks where memory stalls are predominant.At the end of this section is an overview of benchmarkparameters and behavior, see Table 4, chosen according tothe studies that we are re-examining and combining.health simulates a Columbian health care system. Elementsare moved between lists during execution, and thereis more calculation between data references compared to theother benchmarks. Because of poor data structures and algorithms,health is not an exemplary benchmark, pointedout by Zilles, [24]. As results from health are presentedhere, the reader is alerted to read those results with caution.They are still relevant for our memory allocation evaluations.mst creates a graph and calculates its minimal spanningtree. The mst benchmark originally used a localityoptimization procedure which made the effects ofccmalloc() non-existent. The data structures were allocatedin 32 kB blocks, not fitting in the 256 B cc-blocksused in ccmalloc(). mst was thus modified to use anordinary allocation procedure instead, to enable measuringthe effects of ccmalloc().perimeter calculates the perimeter of a region of animage. The data structures are allocated in an order similarto access order, resulting in some kind of locality optimization.There are few calculations between references,complicating prefetch.treeadd calculates a recursive sum of values in a balancedbinary-tree. It is similar to perimeter, but hasslightly more calculations between data references.

Stalls:

Memory stall – an instruction waits a cycle, due to the oldest instruction waiting to be retired – load / store instr.FU stall – the oldest instr. Is not load / store instr.Fetch stall – there is no instruction waiting to be retired.

Prefetch is likely to affect when memory stalls are dominant!!

Graphs:

1

Cache performance - SW

Miss rates are improved by most strategies

Increased spatial locality with ccmalloc() reduces cache misses (less pollution)

Software shows some decrease of misses, but prefetches a lot unused data

Combination of software techniques achieves the lowest rates

Cache performance – cache lines

The larger cache lines the more effective is using ccmalloc()

HW prefetch alone, however, tends to pollute the cache, with unwanted data

SW prefetch alone, tends to bring data already existing in the cache

Cache performance:

SW prefetch achieves higher precision

HW prefetch alone, are no good.

HW prefetch is more sensitive to cache line size than the SW prefetch

Cache performance –SW pref. with ccmalloc()

Results in increased amount of used cache lines, among the prefetched linesThis is caused by increased spatial locality

However! Also results trying prefetching lines already in the cache.

Cache performance – HW prefetch with ccmalloc()

HW are greater improvement with cache-conscious allocation, then on their own,

Prefetch-on-miss and tagged-prefetch both show the same results

Still : large amount of unused prefetched lines

Unused lines decrease with larger cache lines, due to spatial locality, and lack of need to prefetch

Conclusions:

The best way still remains cache conscious allocation – ccmalloc()

Efficient to overcome the drawbacks of large cache line Creates locality necessary for prefetch

The larger the cache line – less prominent the prefetch strategy

1

cache-conscious allocation outperforms both hardware and software prefetch on their own, while software prefetch outperforms hardware prefetch without cache-conscious allocation of data.

Conclusions 2:

Cache-conscious allocation with HW prefetch, is not prominent, and it seems that ccmalloc() alone is enough

However, ccmalloc() can be used to overcome the negative effect of next-line prefetch

HW prefetch is better then SW prefetch

Conclusions 3:

When a compiler can use profiling info. and optimize memory allocation in cache-conscious manner – it’s preferable!

However, when profiling is too expensive – will likely to benefit from general prefetch support.

The endddd!!!

You can tell me, I can take it..

What’s up doc???

לנה סלמן

28.06.2004

Documents

Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented