Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Comparing GCs and Allocation
Richard Jones, Antony Hosking and Eliot Moss 2012
Presented by Yarden Marton
18.11.14
• Comparing between different garbage collectors.
• Allocation – methods and considerations.
Outline
Comparing GCs
• What is the best GC? • When we say “best” do we mean:
- Best throughput? - Shortest pause times? - Good space utilization? - Compromised combination?
Comparing GCs
• More to consider: • Application dependency • Heap space availability • Heap size
• Throughput
• Pause time
• Space
• Implementation
Comparing GCs - Aspects
• Throughput
• Pause time
• Space
• Implementation
Comparing GCs - Aspects
• Primary goal for ‘batch’ applications or for systems experiencing delays.
• Does a faster collector means faster application? Not necessarily.
– Mutators pay the cost
Throughput
Throughput
• Algorithmic complexity • Mark-sweep:
- cost of tracing and sweeping phases. - Requires visiting every object
• Copying: - cost of tracing phase only - Requires visiting only live objects
Throughput
• Is Copying collection faster? • Not necessarily:
- Number of instructions executed to visit an object - Locality - Lazy sweeping
• Throughput
• Pause time
• Space
• Implementation
Comparing GCs - Aspects
Pause Time
• Important for interactive applications, transaction processors and more.
• ‘stop-the-world’ collectors • Immediate attraction to reference counting • However:
- Recursive reference count is costly - Both improvements of reference count
reintroduce a stop-the-world pause
• Throughput
• Pause time
• Space
• Implementation
Comparing GCs - Aspects
Space
• Important for: - Tight physical constraints on memory - Large applications
• All collectors incur space overhead: - Reference count fields - Additional heap space - Heap fragmentation - Auxiliary data structures - Room for garbage
Space
• Completeness – reclaiming all dead objects eventually. - Basic reference counting is incomplete
• Promptness – reclaiming all dead objects at each collection cycle. - Basic tracing collectors (but with a cost)
• Modern high-performances collectors typically trade immediacy for performance.
• Throughput
• Pause time
• Space
• Implementation
Comparing GCs - Aspects
Implementation
• GC algorithms are difficult to implement, especially concurrent algorithms.
• Errors can manifest themselves long afterwards • Tracing:
- Advantage: Simple collector-mutator interface - Disadvantage: Determining roots is complicated
• Reference counting: - Advantage : Can be implemented in a library - Disadvantage : Processing overheads and correctness
essentiality of all reference count manipulation.
• In general, copying and compacting collectors are more complex than non-moving collectors.
Adaptive Systems
• Commercial system often offer a choice between GCs, with a large number of tuning options.
• Researchers have developed systems that adapts to the enviroment: - Java run-time (Soman et al [2004]) - Singer et al [2007a] - Sun’s Ergonomic tuning
Advice For Developers
• Know your application: - Measure its behavior - Track the size and lifetime distributions of the objects it uses.
• Experiment with the different collector configurations on offer.
• Considered two styles of collection:
– Direct, reference counting.
– Indirect, tracing collection.
• Next: An abstract framework for a wide variety of collectors.
A Unified Theory of GC
• GC can be expressed as a fixed point computation that assigns reference counts 𝜌(n) to nodes n ∈ Nodes.
∀𝑟𝑒𝑓 ∈ 𝑁𝑜𝑑𝑒𝑠: 𝜌 𝑟𝑒𝑓 = 𝑓𝑙𝑑 ∈ 𝑅𝑜𝑜𝑡𝑠: ∗ 𝑓𝑙𝑑 = 𝑟𝑒𝑓 + | 𝑓𝑙𝑑 ∈ 𝑃𝑜𝑖𝑛𝑡𝑒𝑟𝑠 𝑛 : 𝑛 ∈ Nodes ˄ 𝜌 𝑛 > 0 ˄ ∗ 𝑓𝑙𝑑 = 𝑟𝑒𝑓 |
• Nodes with non-zero count are retained and the rest should be reclaimed.
• Use of abstract data structures whose implementations can vary.
• W – a work list of objects to be processed. When empty, the algorithms terminate.
Abstract GC
atomic collectTracing():
rootsTracing(W) //find root objects
scanTracing(W) //mark reachable objects
sweepTracing() //free dead objects
rootsTracing(R):
for each fld in Roots
ref ← *fld
if ref ≠ null
R ← R + [ref]
scanTracing(W):
while not isEmpty(W)
src ← remove(W)
𝜌(src) ← 𝜌(src)+1 if 𝜌(src) = 1 for each fld in Pointers(src)
ref ← *fld
if ref ≠ null
W ← W + [ref]
Abstract Tracing GC Algorithm
sweepTracing():
for each noed in Nodes
if 𝜌(node) = 0 free(node)
else
𝜌(node) ← 0
New():
ref ← allocate()
if ref = null
collectTracing()
ref ← allocate()
if ref = null
error “Out of memory”
𝜌(ref) ← 0 return ref
Abstract Tracing GC Algorithm (Continued)
A
D C
B
Roots
A B C D
𝜌 0 0 0 0
W
A
D C
B
Roots
A B C D
𝜌 0 0 0 0
W B C
A
D C
B
Roots
A B C D
𝜌 0 1 0 0
W C
A
D C
B
Roots
A B C D
𝜌 0 1 1 0
W A B
A
D C
B
Roots
A B C D
𝜌 1 1 1 0
W B
A
D C
B
Roots
A B C D
𝜌 1 2 1 0
W
A
D C
B
Roots
A B C D
𝜌 0 0 0 0
W
atomic collectCounting(I,D):
applyIncrements(I) //increase necessary 𝜌 scanCounting(D) //decrease 𝜌 reqursivaly sweepCounting() //free dead objects
applyIncrements(I):
while not isEmpty(I)
ref ← remove(I)
𝜌(ref) ← 𝜌(ref)+1
scanCounting(W):
while not isEmpty(W)
src ← remove(W)
𝜌(src) ← 𝜌(src)-1 if 𝜌(src) = 0 for each fld in Pointers(src)
ref ← *fld
if ref ≠ null
W ← W + [ref]
Abstract reference counting GC Algorithm
sweepCounting():
for each node in Nodes
if 𝜌(node) = 0 free(node)
New():
ref ← allocate()
if ref = null
collectCounting()
ref ← allocate()
if ref = null
error “Out of memory”
𝜌(ref) ← 0 return ref
Abstract reference counting GC Algorithm (Continued)
inc(ref):
if ref ≠ null
I ← I + [ref]
dec(ref):
if ref ≠ null
D ← D + [ref]
Atomic Write(src, i, dst):
inc(dst)
dec(src[i])
src[i] ← dst
Abstract reference counting GC Algorithm (Continued)
A B C D
𝜌 0 0 0 0
A
D C
B
Roots
I A B A D B C B
D A D
atomic collectCounting()
applyIncrements(I)
A B C D
𝜌 1 0 0 0
A
D C
B
Roots
atomic collectCounting()
applyIncrements(I)
I B A D B C B
D A D
A B C D
𝜌 2 3 1 1
A
D C
B
Roots
atomic collectCounting()
applyIncrements(I)
I
D A D
A B C D
𝜌 1 3 1 0
A
D C
B
Roots
atomic collectCounting()
applyIncrements(I)
scanCounting(D)
I
D B
A B C D
𝜌 1 2 1 0
A
D C
B
Roots
atomic collectCounting()
applyIncrements(I)
scanCounting(D)
I
D
A B C D
𝜌 1 2 1 0
A
D C
B
Roots
atomic collectCounting()
applyIncrements(I)
scanCounting(D)
sweepCounting()
I
D
Atomic collecDrc(I,D):
rootsTracing(I) //add root objects to I
applyIncrements(I) //increase necessary 𝜌 scanCounting(D) //decrease 𝜌 reqursively sweepCounting() //free dead objects
rootsTracing(D) //keep invariant
applyDecrements(D)
New():
ref ← allocate()
if ref = null
collecDrc(I,D)
ref ← allocate()
if ref = null
error “Out of memory”
𝜌(ref) ← 0 return ref
Abstract deferred reference counting GC Algorithm
Atomic Write(src, i, dst):
if src ≠ Roots
inc(dst)
dec(src[i])
src[i] ← dst
applyDecrements(D):
while not isEmpty(D)
ref ← remove(D)
𝜌(ref) ← 𝜌(ref)-1
Abstract deferred reference counting GC Algorithm (Continued)
A B C D
𝜌 0 0 0 0
A
D C
B
Roots
I A B A D B
D A D
atomic collectDrc()
rootsTracing(I)
A B C D
𝜌 0 0 0 0
A
D C
B
Roots
I A B A D B B C
D A D
atomic collectDrc()
rootsTracing(I)
applyIncrements(I)
A B C D
𝜌 2 3 1 1
A
D C
B
Roots
I
D A D
atomic collectDrc()
rootsTracing(I)
applyIncrements(I)
scanCounting(D)
A B C D
𝜌 1 2 1 0
A
D C
B
Roots
I
D
atomic collectDrc()
rootsTracing(I)
applyIncrements(I)
scanCounting(D)
sweepCounting()
A B C D
𝜌 1 2 1 0
A
D C
B
Roots
I
D
atomic collectDrc()
rootsTracing(I)
applyIncrements(I)
scanCounting(D)
sweepCounting()
rootsTracing(D)
A B C
𝜌 1 2 1
A
C
B
Roots
I
D B C
atomic collectDrc()
rootsTracing(I)
applyIncrements(I)
scanCounting(D)
sweepCounting()
rootsTracing(D)
applyDecrements(D)
A B C
𝜌 1 1 0
A
C
B
Roots
I
D
atomic collectDrc()
rootsTracing(I)
applyIncrements(I)
scanCounting(D)
sweepCounting()
rootsTracing(D)
applyDecrements(D)
Comparing GCs Summary
• GCs performance depends on various aspects - Therefore, no GC has an absolute advantage on the others.
• Garbage collection can be expressed in an abstract way. - Highlights similarity and differences
Allocation
• Three aspects to memory management: - Allocation of memory in the first place - Identification of live data - Reclamation for future use
• Allocation and reclamation of memory are tightly linked • Several key differences between automatic and explicit
memory management, in terms of allocating and freeing: - GC free space all at once - A system with GC has more information when allocating - With GC, users tends to write programs in a different style.
• Uses a large free chunk of memory
• Given a request for n bytes, it allocates that much from one end of the free chunk.
sequentialAllocate(n):
result ← free
newFree ← result + n
if newFree > limit
return null
free ← newFree
return result
Sequential Allocation
allocated available
free limit Request to allocate n bytes
n
allocated available
free limit
allocated
result
Alignment padding
• Properties:
– Simple
– Efficient
– Better cache locality
– May be less suitable for non-moving collectors
Sequential Allocation
• A data structure records the location and size of free cells of memory.
• The allocator considers each free cell in turn, and according to some policy, chooses one to allocate.
• Three basic types of free-list allocation: – First-fit
– Next-fit
– Best-fit
Free-list Allocation
First-fit Allocation
• Use the first cell that can satisfy the allocation request.
• A split of the cell may occur unless the remainder is too small.
firstFitAllocate(n):
prev ← adressOf(head)
loop
curr ← next(prev)
if curr = null
return null
else if size(curr) < n
prev ← curr
else
return listAllocate(prev, curr, n)
listAllocate(prev, curr, n):
result ← curr
if shouldSplit(size(curr), n)
remainder ← result + n
next(remainder) ← next(curr)
size(remainder) ← size(curr)-n
next(prev) ← remainder
else
next(prev) ← next(curr)
return result
liatAllocateAlt(prev, curr, n):
if sholudSplit(size(curr), n)
size(curr) ← size(curr) – n
result ← curr + size(curr)
else
next(prev) ← next(curr)
result ← curr
return result
First-fit Allocation
150KB 100KB 170KB 300KB 50KB
Allocated Free
120KB allocation request
30KB 100KB 170KB 300KB 50KB
First-fit
30KB 100KB 170KB 300KB 50KB
50KB allocation request
30KB 50KB 170KB 300KB 50KB
30KB 50KB 170KB 300KB 50KB
200KB allocation request
30KB 50KB 170KB 100KB 50KB
• Small remainder cells accumulate near the front of the list, slowing down allocation.
• In terms of space utilization, may behave similarly to best-fit.
• An issue is where in the list to enter a newly freed cell
• It is usually more natural to build the list in address order, like mark-sweep does.
First-fit Allocation
• A variation of first-fit
• Method - start the search for a cell of suitable size from the point in the list where the last search succeeded.
• When reaching the end of list, start over from the beginning.
• Idea - reduce the need to iterate repeatedly past the small cells at the head of the list.
• Drawbacks: – Fragmentation
– Poor locality on accessing the list
– Poor locality of the allocated objects
Next-fit Allocation
nextFitAllocate(n):
start ← prev
loop
curr ← next(prev)
if curr = null
prev ← addressOf(head)
curr ← next(prev)
if prev = start
return null else if size(curr) < n
prev ← curr
else
return listAllocate(prev, curr, n)
Next-fit Allocation Algorithm
150KB 100KB 170KB 300KB 50KB
Allocated Free
120KB allocation request
30KB 100KB 170KB 300KB 50KB
Next-fit
30KB 100KB 170KB 300KB 50KB
20KB allocation request
30KB 80KB 170KB 300KB 50KB
30KB 80KB 170KB 300KB 50KB
50KB allocation request
30KB 80KB 120KB 300KB 50KB
• Method - find the cell whose size most closely matches the allocation request.
• Idea:
– Minimize waste
– Avoid splitting large cells unnecessarily
• Bad worst case
Best-fit Allocation
bestFitAllocate(n):
best ← null
bestSize ← ∞
prev ← addressOf(head)
loop
curr ← next(prev)
if curr = null || size(curr) = n
if curr ≠ null
bestPrev ← prev
best ← curr
else if best = null
return null
return listAllocate(bestPrev, best, n)
else if size(curr) < n || bestSize < size(curr)
prev ← curr
else
best ← curr
bestPrev ← prev
bestSize ← size(curr)
Best-fit Allocation Algorithm
150KB 100KB 170KB 300KB 50KB
Allocated Free
150KB 10KB 170KB 300KB 50KB
90KB allocation request
Best-fit
150KB 10KB 170KB 300KB 50KB
50KB allocation request
150KB 10KB 170KB 300KB
150KB 10KB 170KB 300KB
50KB 10KB 170KB 300KB
100KB allocation request
• Use of a Balanced binary tree
• Sorted by size (for best-fit) or by address (for first-fit or next-fit).
• If sorted by size, can enter only one cell of each size.
• Example: Cartesian tree for first/next-fit.
– Indexed by address (primary key) and size (secondary key)
– Total order by address
– Organized as a heap for the sizes
Speeding Free-list Allocation
• Searching in the Cartesian tree under first-fits policy:
firstFitAllocateCartesian(n):
parent ← null
curr ← root
loop
if left(curr) ≠ null && max(left(curr)) ≥ n
parent ← curr
curr ← left(curr)
else if prev < curr && size(curr) ≥ n
prev ← curr
return treeAllocate(curr, parent, n)
else if right(curr) ≠ null && max(right(curr)) ≥ n
parent ← curr
curr ← right (curr)
else
return null
Speeding Free-list Allocation
• Dispersal of free memory across a possibly large number of small free cells.
• Negative effects: – Can prevent allocation from succeeding
– May cause a program to use more address space, more resident pages and more cache lines.
• Fragmentation is impractical to avoid: – Usually the allocator cannot know what the future request
sequence will be.
– Even given a known request sequence, doing an optimal allocation is NP-hard.
• Usually There is a trade-off between allocation speed and fragmentation.
Fragmentation
• Idea – use multiple free-list whose members are segregated by size in order to speed allocation.
• Usually a fixed number k of size values s0 < s1 < … < sk-1
• k+1 free lists f0,…,fk
• For a free cell, b, on list fi, size b = 𝑠𝑖 ∀ 1 ≤ 𝑖 ≤ 𝑘 − 1
size(b) > sk-1 if i=k
• When requesting a cell of size b≤sk-1, the allocator rounds the request size up to the smallest si such that b ≤si.
• Si is called a size class
Segregated-fits Allocation
SegregatedFitAllocate(j):
result ← remove(freeLists[j])
if result = null
large ← allocateBlock()
if large = null
return null
initialize(large, sizes[j])
result ← remove(freeList[j])
return result
• List fk, for cells larger than sk, is organized to use one of the basic single-list algorithm.
• Per-cell overheads for large cell are a bit higher but in total it is negligible.
• The main advantage: for size classes other than sk, allocation typically requires constant time.
Segregated-fits Allocation
fk-1
fk
f1
f0
s0
s1
sk-1
>sk-1 >sk-1
• On simple free-list allocators – free cells that were too small to satisfy a request. Called external fragmentation.
• On segregated-fits allocation – wasted space inside an individual cell because the requested size was rounded up. Called internal fragmentation.
More on Fragmentation
• Important consideration – how to populate each free-list of segregated-fits.
• Two approaches:
– Dedicating whole blocks to particular sizes
– Splitting
Populating size classes
• Choose some block size B, a power of two.
• The allocator is provided with blocks.
• If the request is larger than one block, multiple contiguous blocks are allocated.
• For a size class s < B, we populate the free-list fs by allocating a block and immediately slice it into cells of size s.
• Metadata of the cells is stored on the block.
Big Bag of Pages Block-based allocation
• Disadvantage:
– Fragmentation, average waste of half a block (worst case (B-s)/B).
• Advantages:
– Reduced per-cell metadata
– Simple and efficient for the common case
Big Bag of Pages Block-based allocation
• Like simple free-list schemes, split a cell if that is the only way to satisfy a request.
• Improvement: concatenate the remaining portion to a suitable free-list (if possible).
• For example – the buddy system:
– Size class are powers of two
– Can split a cell of size 2i+1 into two cells of size 2i
– Can combine in the opposite direction (only if the two small cells were split from the same large cell)
Splitting
128KB
Allocated Minimum cell size – 16KB Free Maximum cell size – 128KB
Allocation request 20KB
The Buddy System
64KB 64KB
Allocated Minimum cell size – 16KB Free Maximum cell size – 64KB
32KB 64KB 32KB
Allocation request 10KB
Allocated Minimum cell size – 16KB Free Maximum cell size – 64KB
12KB 64KB 32KB 20KB
12KB 64KB 16KB 20KB 16KB
Free 10KB
12KB 64KB 16KB 20KB 16KB
Allocated Minimum cell size – 16KB Free Maximum cell size – 64KB
12KB 64KB 20KB 16KB 10KB 6KB
Allocated Minimum cell size – 16KB Free Maximum cell size – 64KB
12KB 64KB 32KB 20KB
Free 20KB
32KB 64KB 32KB
64KB 64KB
Allocated Minimum cell size – 16KB Free Maximum cell size – 64KB
128KB
• Alignment
• Size constraints
• Boundary tags
• Heap parsability
• Locality
Allocation’s Additional Considerations
• Alignment
• Size constraints
• Boundary tags
• Heap parsability
• Locality
Allocation’s Additional Considerations
• Allocated objects may require special alignment
• For example: a double-word floating point
– Can make the granule a double-word – wasteful
– Header of array in java takes 3 words – one word is wasted or skipped.
Alignment
• Alignment
• Size constraints
• Boundary tags
• Heap parsability
• Locality
Allocation’s Additional Considerations
• Some collection schemes require a minimum amount of space in each cell.
– Forwarding address
– Lock/status
• In that case, the allocator will allocate more words than requested.
Size Constraints
• Alignment
• Size constraints
• Boundary tags
• Heap parsability
• Locality
Allocation’s Additional Considerations
• Additional header or boundary tag associated with each cell.
• Found outside the storage available to the program.
• Indicates size and allocated/free status
• Is one or two words long
• Possible use of bitmap instead
Boundary Tags
• Alignment
• Size constraints
• Boundary tags
• Heap parsability
• Locality
Allocation’s Additional Considerations
• The ability to advance cell to cell in the heap
• An object’s header (one or two words):
– Type
– Hash code
– Synchronization information
– Mark bit
• The header comes before the data
• The reference refers to the first element/field
Heap Parsability
• How to handle alignment?
– Zero all free space in advance
– Devise a distinct range of values to write at the start of the gap
• Easier parsing with a bit map, indicating where each object start.
– Require additional space and time
Heap Parsability
• Alignment
• Size constraints
• Boundary tags
• Heap parsability
• Locality
Allocation’s Additional Considerations
• During allocating
– Address-ordered free-list and sequential allocation present good locality.
• During freeing
– Goal: Objects being freed together will be near each other.
– Empirically, objects allocated at the same time often become unreachable at about the same time.
Locality
• Multiple threads allocating
• Most steps in allocation need to be atomic
• Can result a bottleneck
• Basic solution – each thread has its own allocation area.
• Use of a global pool and smart chunk handing
Allocation in Concurrent Systems
Allocation Summary
• Methods: - Sequential - Free-list: First-fit, Next-fit and Best-fit. - Segregated-fits
• Various considerations to notice