View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Cache Conscious Allocation of Pointer Based Data
Structures, Revisited with HW/SW Prefetching
by: Josefin Hallberg, Tuva Palm and Mats Brorsson
Presented by: Lena Salman
IntroductionPointer-based data structures are usually randomly allocated in memory
Will usually not achieve good locality
Higher miss rates
Software approach against Hardware approach
Software approachTwo techniques: Cache concious allocation -
By far the most efficient Software prefetch –
Better suited for automization, better for implementation in compilers.
Combination of both cache-concious allocation and software prefetch, does not add significantly to performance
Hardware approach
Calculating and prefetching pointersCalculating pointer dependenciesEffects of effectively predicting what to evict from the cacheGeneral HW prefetch –
More likely to pollute the cacheProblem! All the hardware strategies take advantage of the increased locality of cache – consciously allocated data.
Prefetching and cache – concious allocation
Should complement each other’s weakness – Reduce the prefetch overhead of fetching
blocks with partially unwanted data.Prefetching should reduce the cache
misses and miss latencies between the nodes
Cache-conscious allocation
Excellent improvement in execution time performance
Can be adapted to specific need by choosing the cache-conscious block-size
(cc – block size)
Attempts to co-allocate data in the same cache-line. Nodes are referenced after each other on the same cache line.
Allocation to improve locality
Cache – conscious allocation
Attempts to allocate the data in the same cache- line
Better locality can be achieved
Improved cache performance by a reduction of misses
ccmalloc()
Does the cache-concious allocation of memory.
Takes extra argument – pointer to data structure that is likely to be referenced.
#ifdef CCMALOC
child=ccmaloc(sizeof(struct node), parent)
#else
child= malloc(sizeof(struct node));
#endif
ccmalloc()
Takes pointer to data that is likely to be referenced close ( in time) to the newly allocated stuctureInvokes calls to the standard malloc():
When allocating new cc-block When data is larger than cc-block
Otherwise: allocate in the empty slot of cc-block
cc-blocks
= Cache – conscious blocksDemands cache lines large enough to contain more than one pointer structureThe bigger the blocks – the lower the miss-rate if allocation is smart.Can be set dynamically in software, independently of the HW cache line size.In our study cc-block size– 256Bhardware cache line size – 16B – 256B
Prefetch
Prefetching will reduce the cost of cache – miss
Can be controlled by software and/or hardware
Software results in extra instructions
Hardware leads to complexity in hardware
Software controlled prefetch
Implemented by including prefetch instruction in the instruction set
Should be inserted well ahead of reference, according to prefetch algorithm
In this study: we will use greedy algorithm, by Mowry et al.
Software prefetch – Greedy algorithm
When a node is referenced, it prefetches all children at that node.Without extra calculation, can only be done to children, not to grandchildrenEasier to control and optimizeThe risk of polluting the cache decreases
(since prefetch only needed lines)
Software greedy prefetch
Hardware Controlled Prefetch
Depending on the algorithm used, prefetching can occur when a miss is caused
Or when a hint is given by the programmer through an instruction,
Or can always occur on certain types of data
Hardware prefetch
Techniques used:
Prefetch on miss
Tagged Prefetch
Attempt to utilize spatial locality
Do NOT analyze data access patterns
Prefetch-on-Miss
Prefetches the next sequantial line i+1,
when detecting miss on line i.
Line i : Miss!
Line i-1
Line i+1 : will be prefetched
Tagged Prefetch
Each prefetched line is tagged with a tag
When a prefetched line - i is referenced, the line i+1 is prefetched.
(no miss has occurred)
Efficient when memory is fairly sequential, and has been shown efficient
Pre-fetch on miss – for ccmalloc()
HW prefetch can be combined with ccmalloc(), by introducing a hint with address to the beginning of such a block.
Prefetch-one-cc on missPrefetch the next line after detecting a cache – miss on a cache-conciously allocated block.
Miss!!
Prefetch-all-cc on missDecides dynamically how many lines to prefetch.
Depends on where on cc-block the missing cache line is located.
Prefetches all the cache lines on the cc-block, from the address causing miss
Miss!!
Experimental Framework
MIPS-like, out-of-order processor simulator.
Memory latency equal to 50 ns random access time.
Benchmarks: health – simulates columbian health care system mst – creates graph and calculates minimum span
tree perimeter – calculates the perimeter of image treeadd – calculates recursive sum of values
More about benchmarks:
health – elements are moved between lists during execution, and there is more calculation between data.
mst – originally used a locality optimization procedure, which made ccmalloc() non noticeable.
perimeter – data allocated in an order similar to access order, resulting locality optimization.
treeadd – has calculation between nodes in a balanced binary tree.
Results: Execution time
Stalls:
Memory stall – an instruction waits a cycle, due to the oldest instruction waiting to be retired – load / store instr.FU stall – the oldest instr. Is not load / store instr.Fetch stall – there is no instruction waiting to be retired.
Prefetch is likely to affect when memory stalls are dominant!!
Graphs:
Cache performance - SW
Miss rates are improved by most strategies
Increased spatial locality with ccmalloc() reduces cache misses (less pollution)
Software shows some decrease of misses, but prefetches a lot unused data
Combination of software techniques achieves the lowest rates
Cache performance – cache lines
The larger cache lines the more effective is using ccmalloc()
HW prefetch alone, however, tends to pollute the cache, with unwanted data
SW prefetch alone, tends to bring data already existing in the cache
Cache performance:
SW prefetch achieves higher precision
HW prefetch alone, are no good.
HW prefetch is more sensitive to cache line size than the SW prefetch
Cache performance –SW pref. with ccmalloc()
Results in increased amount of used cache lines, among the prefetched linesThis is caused by increased spatial locality
However! Also results trying prefetching lines already in the cache.
Cache performance – HW prefetch with ccmalloc()
HW are greater improvement with cache-conscious allocation, then on their own,
Prefetch-on-miss and tagged-prefetch both show the same results
Still : large amount of unused prefetched lines
Unused lines decrease with larger cache lines, due to spatial locality, and lack of need to prefetch
Conclusions:
The best way still remains cache conscious allocation – ccmalloc()
Efficient to overcome the drawbacks of large cache line Creates locality necessary for prefetch
The larger the cache line – less prominent the prefetch strategy
Conclusions 2:
Cache-conscious allocation with HW prefetch, is not prominent, and it seems that ccmalloc() alone is enough
However, ccmalloc() can be used to overcome the negative effect of next-line prefetch
HW prefetch is better then SW prefetch
Conclusions 3:
When a compiler can use profiling info. and optimize memory allocation in cache-conscious manner – it’s preferable!
However, when profiling is too expensive – will likely to benefit from general prefetch support.
The endddd!!!
You can tell me, I can take it..
What’s up doc???
לנה סלמן
28.06.2004