Caching Considerations for Generational Garbage Collection

Caching Considerations for Generational Garbage

Collection

Presented By:Felix Gartsman – 306054172

http://www.cs.tau.ac.il/~gartsma/seminar.ppt

[email protected]

Introduction

Main theme: Effect of memory caches on GC performance

What is a memory cache? How caches work? How and why caches and GC

interact? Can we boost GC performance by

knowing more about caches?

Motivation

CPU and memory performance don’t advance with same speed

When CPU waits for memory it is idle

Solutions: Pipeline, speculative execution and caches

Caches provide fast access for commonly accessed memory

Caches and GC

Two-way relationship: Improving GC performance by

“cache awareness” – minimizing cache misses

GC improving mutator memory access locality and minimizes cache misses by mutator (not dealt by the article)

Previous Work (Outdated!)

Deal mainly with interaction with virtual memory systems

No special attention to Generational GC

Assumed “best/worst cases”, special hardware

Investigated only direct-mapped caches

Article Contribution

Survey GGC performance on various caches

Check techniques for improving performance

Main advice: Try keeping the youngest generation fully in cache. If impossible prefer associative caches

Roadmap

Cache in-depth GC memory reuse cycle GGC as better GC Comparing cache size requirements Comparing misses for different

cache types Conclusions

Cache in-depth

RegistersRegisters

CacheCache

Virtual Memory (Disk)Virtual Memory (Disk)

Memory Hierarchy

L1L1

L2L2

Cache Hierarchy

L3???L3???

Higher level means higher speed and smaller capacity

Miss in one level relays the handling to a lower level

Main MemoryMain Memory

Motivation contd. When a memory word is not in cache, a

“cache miss” occurs Cache miss stalls the CPU, and forces

access to main memory Cache misses are expensive Cache misses become more expensive

with each new generation of CPUs Penalty for memory access in P4: L1 – 2

cycles, L2 – 7, miss – dozens depending on memory type

Cache properties Size (8-64KB in L1, 128KB-3MB in

L2, 6-8MB in L3?) Layout (block size and sub-blocks) Placement (N:M hash function) Associativity Write strategy

Write-through or Write-back Fetch-on-write or write-around

Cache Size

Size – The bigger the better. Too small cache can render fast CPU to sluggish (Intel Celeron as example)

Bigger cache reduces cache misses Constraints:

Physical feasibility (proximity, size, heat) Money (cost vs. performance ratio)

Cache Layout

Cache memory is divided to blocks called “cache lines”

Each line contains validity bit, dirty bit, replacement policy bits, address tag and of course the data

Bigger block reduce misses for good spatial locality. Hurt performance if working on multiple memory regions. Also longer to fill lines

Cache Layout contd.

Can be solved by dividing lines to sub-blocks and managing them separately

Cache Placement

Map memory address to block number Examples:

Address modulo #blocks Select middle bits of address Select a set of bits

Must be fast and “hardware friendly” Should be uniform mapping

Cache Associativity

Fully associative – each address can be in any block. Need to check all tags – slow or expensive. LRU replacement

Direct mapped – address can be only in one block. Fast lookup, but no usage history

Set associative - each address can be in a set (2,4,8) of blocks. A compromise – fast access and limited usage history

Cache Write Strategy

Write-Through – Write directly to memory and of course update the cache (slow, but can use write buffers)

Write-Back – Write to cache, and mark it dirty. Flush to memory later. Very useful for multiple writes to close addresses (object initialization). Can also enjoy write buffers (less useful)

Cache Write Strategy contd.

What to do on write cache miss? Fetch-on-write/Write-allocate – on

miss fetch the corresponding cache line, and treat it as write hit

Write-around/Write-no-allocate – Write directly to memory

Usually Write-back + Write-allocate, Write-through + Write-no-allocate

Modern memory usage

Object-Oriented languages tend to create many small objects for short periods. For example, STL uses value semantics which copies objects for every operation!

Functional languages (Lisp, Scheme) constantly create new objects which replace old ones (cons and friends…)

Modern memory usage contd.

Creation is expensive – allocation with probable write miss (new address used). Article cites sources claiming functional languages writing in up to 25% of their instructions (others 10%)

Memory Recycling Pattern

GC systems tend to violate locality assumptions

Cyclic reuse of memory beats any caching policy. The reuse cycle is too long to be captured

GC systems become bandwidth limited

Allocation is to blame, not GC Locality of the GC process itself is not

“the weakest link” The problem is fast allocation of

memory, which will be reclaimed much later

Main memory filled very fast. What to do?

1. GC – Too frequent, but avoids page2. Use VM – Touches many pages and causes

paging

Pattern Results Allocation touches new memory,

and force a page-in/page fetch (slow)

Why fetch? The memory allocated was used previously. OS doesn’t know it’s garbage, and allocation will overwrite it anyway

Informing OS no fetch required speeds execution

Pattern Results contd. When main exhausted (or the

process isn’t allowed more pages), old pages must be evicted

Those pages are probably dirty – must be written to disk

Even worse – the evicted page is LRU – probably garbage!

Worst case: Disk B/W = 2*Allocation Rate

Another view

View GC allocator as a co-process to the mutator

Each one has it own locality reference

The mutator probably with good spatial locality

The allocator linearly marches through memory

Allocation is cyclic (remember LRU)

Compaction and Semi-Spaces

Compaction helps the mutator, little difference to allocator

Still marches through large memory areas

Trouble with semi-spaces – the tospace was probably evicted. All addresses are replaced – cache flush. Marching through entire heap every second cycle

Solution?

So LRU is bad, can we replace it? We can, but it wont help much Too much memory touched too

frequently Allocator page faults dominate

program execution! Only holding entire reuse cycle in

memory will stop paging

Generational GC

Solution: Touch less memory, less frequently

Divide heap to generations GC the young generation(s) –

touching less memory This eliminates vast memory

marching – memory reuse cycle minimized

Eliminates paging, what about cache?

Generational GC variations

Can use single space – immediate promotion

Can use semi-spaces – promote at will, at the expense of more memory

Better Generational GC

Ungar: Use a pair of semi-spaces and a separate dedicated creation space

This space is emptied and reused every cycle, but the semi-spaces alternate roles as destination

The result: Only little part of semi-spaces are touched, and new objects are created in “hot” space in main memory (and maybe in cache)

Cache Revised

Cache misses can be categorized to:1. Capacity misses – No matter what

cache is used, the miss will occur2. Conflict misses – A miss occurs because

two (or more) addresses mapped to same cache line (set)

Direct mapped suffer from conflict misses the most – every miss evicts blocked with same mapping

Conflict Misses in-depth

Miss rate function is roughly a minimization one

Example: Both addresses map to same line. The first accessed every ms, the second every μs. The (double) miss is every ms.

The rate depends on the usage frequency of addresses not in cache

Minimizing Conflict Misses Rate

Most non-GC systems are skewed – many frequent objects, little others. If placed well, cache is efficient

If many block accessed in intermediate time scale – more misses, and more chances they will interfere each other

Over-simplified to help understanding

Example Program marches memory, while doing

normal activity. We use 16KB cache 2-way associative: the most frequent

block are not touched Direct Mapped: Total flush every cycle Conclusion: It takes twice time to be

remapped in DM, but the result is painful (flush)

DM can’t handle multiple access pattern

Experiments

Instrumented Scheme compiler with integrated cache simulator

Executes millions of instructions, allocates MBs

We’ll present 2 programs:1. Scheme compiler2. Boyer benchmark – objects live long,

and tend to be promoted

Experiments contd.

Cache lines are 16 bytes width 3 collectors:

1. GGC with 2 MB spaces for generation – no promotion ever done

2. GGC with 141 KB spaces for generation

3. #2 + 141 KB creation space (Ungar)

Results (Capacity)

Interpretation

LRU queue distance distribution What it means?

1. The probability of a block to be touched at different point in LRU queue

2. The probability of a block to be touched given how long since it was last touched

3. The probability of a block to be touched given how many other blocks have been more recently

Interpretation contd.

Fourth queue position – 128 KB Eight queue position – 256 KB For any given position – The area

under the curve to the left – cache hits, to the right – misses

Curve’s height at point – the marginal increase in hits due cache enlargement at that point

Experiment Meaning First entries

absorb most hits Collector #1:

Dramatic drop About tenth

position (320 KB) – no need

Collector #2+#3: Hump peaking

when memory starts recycling

Experiment Meaning contd.

#2: Recycling after 141*2 KB, cache of 300-400 KB should suffice

#3: Creation space is constantly recycled, and a small part of other spaces is touched, cache of 200-300 KB should suffice

Experiment Meaning contd. 2

Boyer behaves differently #3 better than #2 by 30% Capacity misses disappear if cache

larger than youngest generation size

Results (Collision)

Interpretation

The graph plots cache size vs. miss rate

Shows results only for collector #3

Experiment Meaning

Associative shows dramatic and almost linear dropdown to 256 KB (contains youngest generation). From then nothing interesting

Direct mapped same on 16-90 KB interval, better on 90-135 KB, much worse latter on

Experiment Meaning contd.

Why DM better at that period? Cache big enough to hold creation

area, and suffers interference for other blocks

Associative evicts before used due collision

Later associative suffers only “re-fill” misses, DM also suffers collisions

More Performance Notes

When cache is too small, most evicted blocks are dirty and require expensive writebacks

Interference may also cause writebacks

Conclusions

Caches are important part of modern computer

Garbage collectors reuse memory in cycles, often march memory

LRU evicts dirty pages/cache lines, needless fetches are costly

GGC reuses smaller area, reduces paging

Conclusions contd.

Similar idea for caches: hold youngest generation entirely

Ungar 3-space proposition reduces required footprint by 30%

Excluding small interval, associative caches perform better than direct mapped which suffer collision misses

Questions?

The End

Documents

Caching Considerations for Generational Garbage Collection