Comparing GCs and Allocation - TAU

Comparing GCs and Allocation

Richard Jones, Antony Hosking and Eliot Moss 2012

Presented by Yarden Marton

18.11.14

• Comparing between different garbage collectors.

• Allocation – methods and considerations.

Outline

Comparing GCs

• What is the best GC? • When we say “best” do we mean:

- Best throughput? - Shortest pause times? - Good space utilization? - Compromised combination?

Comparing GCs

• More to consider: • Application dependency • Heap space availability • Heap size

• Throughput

• Pause time

• Space

• Implementation

Comparing GCs - Aspects

• Throughput

• Pause time

• Space

• Implementation


• Primary goal for ‘batch’ applications or for systems experiencing delays.

• Does a faster collector means faster application? Not necessarily.

– Mutators pay the cost

Throughput

Throughput

• Algorithmic complexity • Mark-sweep:

- cost of tracing and sweeping phases. - Requires visiting every object

• Copying: - cost of tracing phase only - Requires visiting only live objects

Throughput

• Is Copying collection faster? • Not necessarily:

- Number of instructions executed to visit an object - Locality - Lazy sweeping

• Throughput

• Pause time

• Space

• Implementation


Pause Time

• Important for interactive applications, transaction processors and more.

• ‘stop-the-world’ collectors • Immediate attraction to reference counting • However:

- Recursive reference count is costly - Both improvements of reference count

reintroduce a stop-the-world pause

• Throughput

• Pause time

• Space

• Implementation


Space

• Important for: - Tight physical constraints on memory - Large applications

• All collectors incur space overhead: - Reference count fields - Additional heap space - Heap fragmentation - Auxiliary data structures - Room for garbage

Space

• Completeness – reclaiming all dead objects eventually. - Basic reference counting is incomplete

• Promptness – reclaiming all dead objects at each collection cycle. - Basic tracing collectors (but with a cost)

• Modern high-performances collectors typically trade immediacy for performance.

• Throughput

• Pause time

• Space

• Implementation


Implementation

• GC algorithms are difficult to implement, especially concurrent algorithms.

• Errors can manifest themselves long afterwards • Tracing:

- Advantage: Simple collector-mutator interface - Disadvantage: Determining roots is complicated

• Reference counting: - Advantage : Can be implemented in a library - Disadvantage : Processing overheads and correctness

essentiality of all reference count manipulation.

• In general, copying and compacting collectors are more complex than non-moving collectors.

Adaptive Systems

• Commercial system often offer a choice between GCs, with a large number of tuning options.

• Researchers have developed systems that adapts to the enviroment: - Java run-time (Soman et al [2004]) - Singer et al [2007a] - Sun’s Ergonomic tuning

Advice For Developers

• Know your application: - Measure its behavior - Track the size and lifetime distributions of the objects it uses.

• Experiment with the different collector configurations on offer.

• Considered two styles of collection:

– Direct, reference counting.

– Indirect, tracing collection.

• Next: An abstract framework for a wide variety of collectors.

A Unified Theory of GC

• GC can be expressed as a fixed point computation that assigns reference counts 𝜌(n) to nodes n ∈ Nodes.

∀𝑟𝑒𝑓 ∈ 𝑁𝑜𝑑𝑒𝑠: 𝜌 𝑟𝑒𝑓 = 𝑓𝑙𝑑 ∈ 𝑅𝑜𝑜𝑡𝑠: ∗ 𝑓𝑙𝑑 = 𝑟𝑒𝑓 + | 𝑓𝑙𝑑 ∈ 𝑃𝑜𝑖𝑛𝑡𝑒𝑟𝑠 𝑛 : 𝑛 ∈ Nodes ˄ 𝜌 𝑛 > 0 ˄ ∗ 𝑓𝑙𝑑 = 𝑟𝑒𝑓 |

• Nodes with non-zero count are retained and the rest should be reclaimed.

• Use of abstract data structures whose implementations can vary.

• W – a work list of objects to be processed. When empty, the algorithms terminate.

Abstract GC

atomic collectTracing():

rootsTracing(W) //find root objects

scanTracing(W) //mark reachable objects

sweepTracing() //free dead objects

rootsTracing(R):

for each fld in Roots

ref ← *fld

if ref ≠ null

R ← R + [ref]

scanTracing(W):

while not isEmpty(W)

src ← remove(W)

𝜌(src) ← 𝜌(src)+1 if 𝜌(src) = 1 for each fld in Pointers(src)

ref ← *fld

if ref ≠ null

W ← W + [ref]

Abstract Tracing GC Algorithm

sweepTracing():

for each noed in Nodes

if 𝜌(node) = 0 free(node)

else

𝜌(node) ← 0

New():

ref ← allocate()

if ref = null

collectTracing()

ref ← allocate()

if ref = null

error “Out of memory”

𝜌(ref) ← 0 return ref

Abstract Tracing GC Algorithm (Continued)

A

D C

B

Roots

A B C D

𝜌 0 0 0 0

W

A

D C

B

Roots

A B C D

𝜌 0 0 0 0

W B C

A

D C

B

Roots

A B C D

𝜌 0 1 0 0

W C

A

D C

B

Roots

A B C D

𝜌 0 1 1 0

W A B

A

D C

B

Roots

A B C D

𝜌 1 1 1 0

W B

A

D C

B

Roots

A B C D

𝜌 1 2 1 0

W

A

D C

B

Roots

A B C D

𝜌 0 0 0 0

W

atomic collectCounting(I,D):

applyIncrements(I) //increase necessary 𝜌 scanCounting(D) //decrease 𝜌 reqursivaly sweepCounting() //free dead objects

applyIncrements(I):

while not isEmpty(I)

ref ← remove(I)

𝜌(ref) ← 𝜌(ref)+1

scanCounting(W):

while not isEmpty(W)

src ← remove(W)

𝜌(src) ← 𝜌(src)-1 if 𝜌(src) = 0 for each fld in Pointers(src)

ref ← *fld

if ref ≠ null

W ← W + [ref]

Abstract reference counting GC Algorithm

sweepCounting():

for each node in Nodes

if 𝜌(node) = 0 free(node)

New():

ref ← allocate()

if ref = null

collectCounting()

ref ← allocate()

if ref = null



Abstract reference counting GC Algorithm (Continued)

inc(ref):

if ref ≠ null

I ← I + [ref]

dec(ref):

if ref ≠ null

D ← D + [ref]

Atomic Write(src, i, dst):

inc(dst)

dec(src[i])

src[i] ← dst

Abstract reference counting GC Algorithm (Continued)

A B C D

𝜌 0 0 0 0

A

D C

B

Roots

I A B A D B C B

D A D

atomic collectCounting()

applyIncrements(I)

A B C D

𝜌 1 0 0 0

A

D C

B

Roots


applyIncrements(I)

I B A D B C B

D A D

A B C D

𝜌 2 3 1 1

A

D C

B

Roots


applyIncrements(I)

I

D A D

A B C D

𝜌 1 3 1 0

A

D C

B

Roots


applyIncrements(I)

scanCounting(D)

I

D B

A B C D

𝜌 1 2 1 0

A

D C

B

Roots


applyIncrements(I)

scanCounting(D)

I

D

A B C D

𝜌 1 2 1 0

A

D C

B

Roots


applyIncrements(I)

scanCounting(D)

sweepCounting()

I

D

Atomic collecDrc(I,D):

rootsTracing(I) //add root objects to I

applyIncrements(I) //increase necessary 𝜌 scanCounting(D) //decrease 𝜌 reqursively sweepCounting() //free dead objects

rootsTracing(D) //keep invariant

applyDecrements(D)

New():

ref ← allocate()

if ref = null

collecDrc(I,D)

ref ← allocate()

if ref = null



Abstract deferred reference counting GC Algorithm

Atomic Write(src, i, dst):

if src ≠ Roots

inc(dst)

dec(src[i])

src[i] ← dst

applyDecrements(D):

while not isEmpty(D)

ref ← remove(D)

𝜌(ref) ← 𝜌(ref)-1

Abstract deferred reference counting GC Algorithm (Continued)

A B C D

𝜌 0 0 0 0

A

D C

B

Roots

I A B A D B

D A D

atomic collectDrc()

rootsTracing(I)

A B C D

𝜌 0 0 0 0

A

D C

B

Roots

I A B A D B B C

D A D

atomic collectDrc()

rootsTracing(I)

applyIncrements(I)

A B C D

𝜌 2 3 1 1

A

D C

B

Roots

I

D A D

atomic collectDrc()

rootsTracing(I)

applyIncrements(I)

scanCounting(D)

A B C D

𝜌 1 2 1 0

A

D C

B

Roots

I

D

atomic collectDrc()

rootsTracing(I)

applyIncrements(I)

scanCounting(D)

sweepCounting()

A B C D

𝜌 1 2 1 0

A

D C

B

Roots

I

D

atomic collectDrc()

rootsTracing(I)

applyIncrements(I)

scanCounting(D)

sweepCounting()

rootsTracing(D)

A B C

𝜌 1 2 1

A

C

B

Roots

I

D B C

atomic collectDrc()

rootsTracing(I)

applyIncrements(I)

scanCounting(D)

sweepCounting()

rootsTracing(D)

applyDecrements(D)

A B C

𝜌 1 1 0

A

C

B

Roots

I

D

atomic collectDrc()

rootsTracing(I)

applyIncrements(I)

scanCounting(D)

sweepCounting()

rootsTracing(D)

applyDecrements(D)

Comparing GCs Summary

• GCs performance depends on various aspects - Therefore, no GC has an absolute advantage on the others.

• Garbage collection can be expressed in an abstract way. - Highlights similarity and differences

Allocation

• Three aspects to memory management: - Allocation of memory in the first place - Identification of live data - Reclamation for future use

• Allocation and reclamation of memory are tightly linked • Several key differences between automatic and explicit

memory management, in terms of allocating and freeing: - GC free space all at once - A system with GC has more information when allocating - With GC, users tends to write programs in a different style.

• Uses a large free chunk of memory

• Given a request for n bytes, it allocates that much from one end of the free chunk.

sequentialAllocate(n):

result ← free

newFree ← result + n

if newFree > limit

return null

free ← newFree

return result

Sequential Allocation

allocated available

free limit Request to allocate n bytes

n

allocated available

free limit

allocated

result

Alignment padding

• Properties:

– Simple

– Efficient

– Better cache locality

– May be less suitable for non-moving collectors

Sequential Allocation

• A data structure records the location and size of free cells of memory.

• The allocator considers each free cell in turn, and according to some policy, chooses one to allocate.

• Three basic types of free-list allocation: – First-fit

– Next-fit

– Best-fit

Free-list Allocation

First-fit Allocation

• Use the first cell that can satisfy the allocation request.

• A split of the cell may occur unless the remainder is too small.

firstFitAllocate(n):

prev ← adressOf(head)

loop

curr ← next(prev)

if curr = null

return null

else if size(curr) < n

prev ← curr

else

return listAllocate(prev, curr, n)

listAllocate(prev, curr, n):

result ← curr

if shouldSplit(size(curr), n)

remainder ← result + n

next(remainder) ← next(curr)

size(remainder) ← size(curr)-n

next(prev) ← remainder

else

next(prev) ← next(curr)

return result

liatAllocateAlt(prev, curr, n):

if sholudSplit(size(curr), n)

size(curr) ← size(curr) – n

result ← curr + size(curr)

else

next(prev) ← next(curr)

result ← curr

return result


150KB 100KB 170KB 300KB 50KB

Allocated Free

120KB allocation request

30KB 100KB 170KB 300KB 50KB

First-fit

30KB 100KB 170KB 300KB 50KB


30KB 50KB 170KB 300KB 50KB

30KB 50KB 170KB 300KB 50KB


30KB 50KB 170KB 100KB 50KB

• Small remainder cells accumulate near the front of the list, slowing down allocation.

• In terms of space utilization, may behave similarly to best-fit.

• An issue is where in the list to enter a newly freed cell

• It is usually more natural to build the list in address order, like mark-sweep does.


• A variation of first-fit

• Method - start the search for a cell of suitable size from the point in the list where the last search succeeded.

• When reaching the end of list, start over from the beginning.

• Idea - reduce the need to iterate repeatedly past the small cells at the head of the list.

• Drawbacks: – Fragmentation

– Poor locality on accessing the list

– Poor locality of the allocated objects

Next-fit Allocation

nextFitAllocate(n):

start ← prev

loop

curr ← next(prev)

if curr = null

prev ← addressOf(head)

curr ← next(prev)

if prev = start

return null else if size(curr) < n

prev ← curr

else

return listAllocate(prev, curr, n)

Next-fit Allocation Algorithm

150KB 100KB 170KB 300KB 50KB

Allocated Free


30KB 100KB 170KB 300KB 50KB

Next-fit

30KB 100KB 170KB 300KB 50KB


30KB 80KB 170KB 300KB 50KB

30KB 80KB 170KB 300KB 50KB


30KB 80KB 120KB 300KB 50KB

• Method - find the cell whose size most closely matches the allocation request.

• Idea:

– Minimize waste

– Avoid splitting large cells unnecessarily

• Bad worst case

Best-fit Allocation

bestFitAllocate(n):

best ← null

bestSize ← ∞

prev ← addressOf(head)

loop

curr ← next(prev)

if curr = null || size(curr) = n

if curr ≠ null

bestPrev ← prev

best ← curr

else if best = null

return null

return listAllocate(bestPrev, best, n)

else if size(curr) < n || bestSize < size(curr)

prev ← curr

else

best ← curr

bestPrev ← prev

bestSize ← size(curr)

Best-fit Allocation Algorithm

150KB 100KB 170KB 300KB 50KB

Allocated Free

150KB 10KB 170KB 300KB 50KB


Best-fit

150KB 10KB 170KB 300KB 50KB


150KB 10KB 170KB 300KB

150KB 10KB 170KB 300KB

50KB 10KB 170KB 300KB


• Use of a Balanced binary tree

• Sorted by size (for best-fit) or by address (for first-fit or next-fit).

• If sorted by size, can enter only one cell of each size.

• Example: Cartesian tree for first/next-fit.

– Indexed by address (primary key) and size (secondary key)

– Total order by address

– Organized as a heap for the sizes

Speeding Free-list Allocation

• Searching in the Cartesian tree under first-fits policy:

firstFitAllocateCartesian(n):

parent ← null

curr ← root

loop

if left(curr) ≠ null && max(left(curr)) ≥ n

parent ← curr

curr ← left(curr)

else if prev < curr && size(curr) ≥ n

prev ← curr

return treeAllocate(curr, parent, n)

else if right(curr) ≠ null && max(right(curr)) ≥ n

parent ← curr

curr ← right (curr)

else

return null

Speeding Free-list Allocation

• Dispersal of free memory across a possibly large number of small free cells.

• Negative effects: – Can prevent allocation from succeeding

– May cause a program to use more address space, more resident pages and more cache lines.

• Fragmentation is impractical to avoid: – Usually the allocator cannot know what the future request

sequence will be.

– Even given a known request sequence, doing an optimal allocation is NP-hard.

• Usually There is a trade-off between allocation speed and fragmentation.

Fragmentation

• Idea – use multiple free-list whose members are segregated by size in order to speed allocation.

• Usually a fixed number k of size values s0 < s1 < … < sk-1

• k+1 free lists f0,…,fk

• For a free cell, b, on list fi, size b = 𝑠𝑖 ∀ 1 ≤ 𝑖 ≤ 𝑘 − 1

size(b) > sk-1 if i=k

• When requesting a cell of size b≤sk-1, the allocator rounds the request size up to the smallest si such that b ≤si.

• Si is called a size class

Segregated-fits Allocation

SegregatedFitAllocate(j):

result ← remove(freeLists[j])

if result = null

large ← allocateBlock()

if large = null

return null

initialize(large, sizes[j])

result ← remove(freeList[j])

return result

• List fk, for cells larger than sk, is organized to use one of the basic single-list algorithm.

• Per-cell overheads for large cell are a bit higher but in total it is negligible.

• The main advantage: for size classes other than sk, allocation typically requires constant time.

Segregated-fits Allocation

fk-1

fk

f1

f0

s0

s1

sk-1

>sk-1 >sk-1

• On simple free-list allocators – free cells that were too small to satisfy a request. Called external fragmentation.

• On segregated-fits allocation – wasted space inside an individual cell because the requested size was rounded up. Called internal fragmentation.

More on Fragmentation

• Important consideration – how to populate each free-list of segregated-fits.

• Two approaches:

– Dedicating whole blocks to particular sizes

– Splitting

Populating size classes

• Choose some block size B, a power of two.

• The allocator is provided with blocks.

• If the request is larger than one block, multiple contiguous blocks are allocated.

• For a size class s < B, we populate the free-list fs by allocating a block and immediately slice it into cells of size s.

• Metadata of the cells is stored on the block.

Big Bag of Pages Block-based allocation

• Disadvantage:

– Fragmentation, average waste of half a block (worst case (B-s)/B).

• Advantages:

– Reduced per-cell metadata

– Simple and efficient for the common case

Big Bag of Pages Block-based allocation

• Like simple free-list schemes, split a cell if that is the only way to satisfy a request.

• Improvement: concatenate the remaining portion to a suitable free-list (if possible).

• For example – the buddy system:

– Size class are powers of two

– Can split a cell of size 2i+1 into two cells of size 2i

– Can combine in the opposite direction (only if the two small cells were split from the same large cell)

Splitting

128KB

Allocated Minimum cell size – 16KB Free Maximum cell size – 128KB

Allocation request 20KB

The Buddy System

64KB 64KB


32KB 64KB 32KB

Allocation request 10KB


12KB 64KB 32KB 20KB

12KB 64KB 16KB 20KB 16KB

Free 10KB

12KB 64KB 16KB 20KB 16KB


12KB 64KB 20KB 16KB 10KB 6KB


12KB 64KB 32KB 20KB

Free 20KB

32KB 64KB 32KB

64KB 64KB


128KB

• Alignment

• Size constraints

• Boundary tags

• Heap parsability

• Locality

Allocation’s Additional Considerations

• Alignment


• Boundary tags


• Locality


• Allocated objects may require special alignment

• For example: a double-word floating point

– Can make the granule a double-word – wasteful

– Header of array in java takes 3 words – one word is wasted or skipped.

Alignment

• Alignment


• Boundary tags


• Locality


• Some collection schemes require a minimum amount of space in each cell.

– Forwarding address

– Lock/status

• In that case, the allocator will allocate more words than requested.

Size Constraints

• Alignment


• Boundary tags


• Locality


• Additional header or boundary tag associated with each cell.

• Found outside the storage available to the program.

• Indicates size and allocated/free status

• Is one or two words long

• Possible use of bitmap instead

Boundary Tags

• Alignment


• Boundary tags


• Locality


• The ability to advance cell to cell in the heap

• An object’s header (one or two words):

– Type

– Hash code

– Synchronization information

– Mark bit

• The header comes before the data

• The reference refers to the first element/field

Heap Parsability

• How to handle alignment?

– Zero all free space in advance

– Devise a distinct range of values to write at the start of the gap

• Easier parsing with a bit map, indicating where each object start.

– Require additional space and time

Heap Parsability

• Alignment


• Boundary tags


• Locality


• During allocating

– Address-ordered free-list and sequential allocation present good locality.

• During freeing

– Goal: Objects being freed together will be near each other.

– Empirically, objects allocated at the same time often become unreachable at about the same time.

Locality

• Multiple threads allocating

• Most steps in allocation need to be atomic

• Can result a bottleneck

• Basic solution – each thread has its own allocation area.

• Use of a global pool and smart chunk handing

Allocation in Concurrent Systems

Allocation Summary

• Methods: - Sequential - Free-list: First-fit, Next-fit and Best-fit. - Segregated-fits

• Various considerations to notice

Documents

Comparing GCs and Allocation - TAU