© Imperial College London Exploring the Barrier to Entry Incremental Generational Garbage Collection for Haskell Andy Cheadle & Tony Field Imperial College

© Imperial College London

Exploring the Barrier to Entry

Incremental Generational Garbage Collection for Haskell

Andy Cheadle & Tony FieldImperial College London

Simon Marlow & Simon Peyton JonesMicrosoft Research, Cambridge, UK

Lyndon WhileThe University of Western Australia, Perth

© Imperial College LondonPage 2

Introduction

We focus on Haskell with the intent of building an:

• Efficient• Barrierless• Hybrid• Incremental• Generational• …garbage collector for GHC

Investigate pause time bounds and mutator utilisation.

Explore application to other dynamic dispatch systems.


Highlights

• Improving Non-Stop Haskell– Incremental GC read-barrier optimisation without

the per-object space overhead

• Bridging the Generation Gap– Generational GC write-barrier optimisation

• Consistent Mutator Utilisation– Time-based versus Work based scheduling


Barriers: Friend or Foe - Summary

• Blackburn & Hosking - ISMM 2004

• Conditional read-barrier– AMD: 21.24%, P4: 15.91%, PPC: 6.49%– Incremental GC: Standard Baker read-barrier

• Unconditional read-barrier– AMD: 8.05%, P4: 5.04%, PPC: 0.85%– Brooks indirection read-barrier– Metronome ‘Eager’ barrier ~ 4%– BUT: space overhead -> increased GC count

• Must consider GC cost!!!


Non-Stop Haskell

• Implementing Baker’s incremental collector typically introduces high overheads

– The software read-barrier

• We have shown that this can be done efficiently in systems with dynamic dispatching

CaveatDynamic dispatching already “costs” something; we show that incremental garbage collection comes at virtually no extra cost.


Dynamic Dispatch and the STG Machine

• The STG machine is a model for the compilation of lazy functional languages

• All objects are represented on the heap as closures:

• To compute function ‘f’ applied to arguments ‘a b c d’ jump to Entry code

0: 3: imm2:1: 4: imm

2, 2

Other fields

Entry code …

heap pointers

a b c df

static info table


The Read-Barrier Invariant

2r

2: unscavenged

3r

3: unscavenged

Stack top

from-space

to-space

1r

1: scavenged

Problem 1

Problem 2


• When the garbage collector is on make info pointers point to code that scavenges evacuated closures before entering them

• At all other times the system operates with no read barrier!

Invariant Problem 1: Scavenging Closures

0: 3: imm2:1: 4: imm

Self-scav code …

heap pointers

2, 2

Other fields


Q How do we restore the original info pointer?

A We remember it when the closure is evacuated

Non-Stop Haskell:

• Use an extra word in to-space

• Note: the space overhead applies only to objects copied from from-space but effectively reduces to-space by 30%

• Freshly allocated objects carry no space overhead

0: 3: imm2:1: 4: imm

2, 2

Other fields

Entry code …

heap pointers

-1:

Self-scav code …

2, 2

Other fields


Q How do we restore the original info pointer?

A We remember it when the closure is evacuated

In production:

• Specialise every closure type at compile time

• Runtime space overhead is replaced by a static one of ~ 25%

0: 3: imm2:1: 4: imm

2, 2

Other fields

Entry code …

heap pointers

Self-scav code

JMP Entry code

2, 2

Other fields


Invariant Problem 2: Stack Scavenging

• STG machine stack frames look just like closures

• Before returning to the caller frame we ‘hijack’ the caller’s return address, replacing it with a pointer to self-scavenging code for that frame

1: scavenged

2r

2: unscavenged

3r

3: unscavenged

3r

3: unscavenged

2: scavengedscav; mod 3r; update; return

scav; mod 4r; update; returnupdate; return


Background Scavenging

• GHC’s heap is block allocated. So, scavenge at:– Every Allocation (EA)

– Every Block allocation (EB)

• Reduce forced-completions via block chaining

• Incremental scavenger pauses are allocation-dependent

• Exploit GHC’s lightweight scheduler to implement a time-scheduled scavenger (Jikes RVM Metronome)– Consistent mutator utilisation

– Increase in forced-completions due to allocation bursts


Results – Binary Sizes

Max 36

Min 36

x2n1

wave4main

symalg

lcss

circsim

Baker

(EA)Stop-copy

(KB)

Application Baker

(EB)

NSH

(EA)

NSH

(EB)

SPS

(EA)

SPS

(EB)

Metronome

10 ms

All 36

287

239

175

424

325

12.99%

12.76%

12.82%

12.84%

12.71%

12.12%

14.40%

12.98%

8.67%

8.75%

9.76%

7.93%

8.34%

7.93%

9.67%

8.81%

14.15%

15.83%

19.25%

11.49%

13.04%

9.00%

20.22%

14.87%

9.73%

11.67%

16.01%

6.51%

8.55%

2.98%

16.90%

10.54%

32.20%

30.83%

29.89%

28.03%

28.46%

26.98%

31.90%

29.59%

26.72%

26.65%

26.63%

22.95%

23.95%

22.04%

27.83%

25.23%

34.71%

34.02%

32.14%

32.11%

32.04%

29.05%

34.74%

32.95%


Results – Runtimes

Max 36

Min 36

x2n1

wave4main

symalg

lcss

circsim

Baker

(EA)Stop-copy

(seconds)

Application Baker

(EB)

NSH

(EA)

NSH

(EB)

SPS

(EA)

SPS

(EB)

Metronome

10 ms

All 36

39.64

54.24

181.41

26.28

218.39

29.33%

56.80%

16.56%

7.18%

6.52%

-0.11%

79.02%

27.34%

23.30%

34.05%

17.25%

0.51%

4.10%

-0.08%

73.56%

23.61%

20.29%

35.47%

7.06%

8.74%

9.45%

-4.04%

35.47%

11.22%

13.94%

17.74%

5.57%

0.68%

8.28%

-3.40%

37.26%

8.26%

8.70%

23.66%

2.88%

7.77%

9.77%

-3.10%

24.20%

7.94%

5.58%

8.03%

3.76%

0.39%

9.13%

-2.54%

18.65%

4.88%

9.73%

18.75%

-13.52%

2.20%

6.09%

-13.52%

70.89%

13.63%








The Generational Write-barrier

root set for generation N – 1

inter-generational pointer

generation N generation N - 1

root set

Depending on the number of updates, the write-barrier can impose an overhead of 8 – 24% (NJ/ML and Clean).


Bridging the Generation Gap

We implement in GHC a mechanism that again exploits dynamic dispatch to eliminate unnecessary write-barriers:

root set for generation 0

generation 0

THUNK_SELECT

THUNK_1

THUNK_2

root set

Promote to generation 1




generation 1 generation 0

THUNK_SELECT

THUNK_1 THUNK_2

IND_PRE_UPDIND_PRE_UPD

root set

force THUNK selectee evaluation





THUNK_SELECT

THUNK_1 THUNK_2

IND_UPDIND_PRE_UPD

root set





THUNK_SELECT

THUNK_1 IND_OLDGEN

IND_UPDIND_PRE_UPD

root set

CONSTR_2

inter-generational pointer

Preliminary benchmarks suggested a reduction of 5 - 9%, in production it is actually around 2 - 3%.


Ongoing Work

• Unfortunately Java programs are not “pure” in their use of dynamic dispatch

– Field access via get() / set() methods– Inlining must be disallowed

Application of read-barrier optimisation to Java

Investigating within Jikes RVM:

• Inter- and intra-class inlining

• Code bloat arising from get() / set() methods, restricted inlining and additional per-class VMT

• Cost of VMT TIB pointer flip


Removal of collector-specific barriers and tests:• Yields cheaper ‘vanilla’ collectors

• Allows the efficient hybridisation of multiple collector algorithms

Conclusion

Time-based scheduling is massively attractive, but: • Complete decoupling from the allocator is problematic*

• A hybrid approach looks promising:– Parameterised by mutator utilisation– Sensitive to allocation rate

Elimination of per-object overhead:• Mandatory for our production collector

Documents

© Imperial College London Exploring the Barrier to Entry Incremental Generational Garbage Collection for Haskell Andy Cheadle & Tony Field Imperial College