Transcript

Taking Off The GlovesWith Reference Counting Immix

Rifat ShahriyarXi Yang

Stephen M. BlackburnAustralian National University

Kathryn S. McKinleyMicrosoft Research

2

53 Years Ago…

3

The Birth of GC

4

Today…

5

Why Reference Counting?

Advantages✔ Reclaim as-you-go✔ Object-local✔ Basic RC is easy

Disadvantages✘ Cycles✘ Performance

Series1

40%

9%Tota

l Tim

e v

Pro

du

cti

on

Backup tracing

<2013 2013

OurGoal

6

Why So Slow?

Total Time

Mutator Time

GC Time

9% 9%

-3%

Tim

e v

Pro

du

cti

on

Total Mutator

GC

7

Looking a Little Deeper…

Mutator time Instructions retired L1D cache misses

9% 9%

32%

6%4%

28%

-2% -3% -2%-3% -3%

1%

RC MS SS Immix

Mu

tato

r v P

rod

ucti

on

Time InstructionsRetired

L1 DCache Misses

8

Free List vs. Bump Pointer

Bump Pointer

Free List

9

Looking a Little Deeper…

Mutator time Instructions retired L1D cache misses

9% 9%

32%

6%4%

28%

-2% -3% -2%-3% -3%

1%

RC MS SS Immix

Mu

tato

r v P

rod

ucti

on

Time InstructionsRetired

L1 DCache Misses

Free List

Bump Pointer

10

Reference Counting

11

Basic Reference Counting[Collins 1960]

A B C D E F1 1 1 1 3 12 2

E0 1

12

How RC worksFundamental optimizations

• Backup tracing [Weizenbaum 1969]

– Reclaim cyclic garbage

• Deferral [Deutsch and Bobrow 1976]

– Note changes to stacks & registers occasionally

• Coalescing [Levanoni and Petrank 2001]

– Note only initial and final state of references

13

Deferral[Deutsch and Bobrow 1976, Bacon et al. 2001]

Stacks & Registers

A++F++

B--D++ A--

F--

A B C D FE1 1 1 1 2 1

A--

21 0 2 2

mutator activityGC: scan rootsGC: apply incrementsGC: apply decrementsGC: collectGC: move deferred decsA--F--

++ -- --'

14

Coalescing[Levanoni and Patrank 2001]

B--

Remember A Ignore intermediate mutations

Compare A, Aold

B--, F++

C++C--

D++D--

E++E--

F++

A B C D FE

15

How RC worksRecent Optimizations

• Limited bit count [Shahriyar et al. 2012]

– Use just few bits, fix o/f with backup tracing

• Elision of new object counts [Shahriyar et al. 2012]

– Only do RC work if object survives to first GC

• Allocate as dead [Shahriyar et al. 2012]

– Avoid free-list work for short lived objects

16

How Immix works

0

• Contiguous allocation into regions– 256B lines and 32KB blocks– Objects span lines but not blocks

• Simple mark phase– Mark objects and containing regions

• Free unmarked regions • Recycled allocation and defragmentation

block

line

recyclable linesobject mark line mark

17

Goal,Challenges,

Contributions

18

Goal & Challenges

• Goal– Object-local pay-as-you-go collection– Excellent mutator locality– Copying to eliminate fragmentation

• Immix provides opportunistic copying ✔Same mutator locality as contiguous allocator

• However, RC is inherently localReferences to an object generally unknown……but copying must redirect all references

19

Contributions

✔Identify heap layout as bottleneck for RC✔Introduce copying RC (RC Immix)

✔Exploit Immix’s opportunistic copy✔Observe new objects can be copied by first GC✔Observe old objects can be copied by backup GC✔Line/block reclamation, header bits

✔Deliver great performance

20

Design of RC Immix

21

Reference Countingin RC Immix

• Reference count for object• Live object count for line

– Lines ‘born dead’ (zero live object count)– Inc when any object gets first RC increment– Dec when any object is dead

• Collect lines with zero live object count

0 01 3 1 2

11 3 2 1 2200

22

Cycle Collectionin RC Immix

0

• Live object counts zeroed• Trace marks live objects and lines

– Corrects incorrect counts (due to cycles)

• Sweep– Collects unmarked lines– Sweeps dead lines, not dead objects

13 2122 4 0 00 0 2

23

DefragmentationIn RC Immix

• RC is object-local, inhibiting copying• But, RC Immix seizes two opportunities

1. All references to new objects known at first GC2. Backup tracing performs a global trace

• Use opportunistic copying in both cases– Mix copying with in-place RC and marking – Stop copying when available space exhausted

24

Proactive Defragmentation

• Copy surviving new objects (with bounded reserve)

• Optimization, not for correctness– Reserve sized for performance unlike semi-space

• Use past survival rate to predict the future

1 2120 30 41 521 3

25

Reactive Defragmentation

• Backup tracing performs a global trace• Piggyback on this, copy live objects• Use available memory threshold

– If below threshold, do defrag at next cycle GC

26

Methodology

27

Hardware, Software & Benchmarks

• 21 benchmarks– DaCapo, SPECjvm98 and pjbb2005

• 20 invocations for each benchmark• Jikes RVM and MMTk

– All garbage collectors are parallel

• Intel Core i7 2600K, 4GB• Ubuntu 10.04.1 LTS

28

Results

29

Bottom LineGeomean of all benchmarks, versus production

heap size = 2x the minimum heap size

3% improvement over production on geomean

-35%

-30%

-25%

-20%

-15%

-10%

-5%

0%

5%

10%

15%

RC

RC Immix

TotalTime

MutatorTime

GCTime

30

Total TimeBy Benchmark

heap size = 2x the minimum heap size

+5% worst case, -25% best case

-30%

-20%

-10%

0%

10%

20%

30%

40%

RC RC Immix

fast

er

Tim

e

slo

we

r

jess db

javac

mtr

t

jack

avro

ra

blo

at

chart

ecl

ipse fop

hsq

ldb

jyth

on

luin

dex

luse

arc

hfix

pm

d

sunflow

xala

n

pjb

b2

00

5

com

pre

ss

31

Mutator TimeBy Benchmark

heap size = 2x the minimum heap size

+4% worst case, -10% best case

-15%

-10%

-5%

0%

5%

10%

15%

20%

25%

30%

RC RC Immix

fast

er ←

M

uta

tor

→ s

low

er

jess db

javac

mtr

t

jack

avro

ra

blo

at

chart

ecl

ipse fop

hsq

ldb

jyth

on

luin

dex

luse

arc

hfix

pm

d

sunflow

xala

n

pjb

b2

00

5

com

pre

ss

32

GC TimeBy Benchmark

heap size = 2x the minimum heap size

+5% worst case, -25% best case

-30%

-20%

-10%

0%

10%

20%

30%

40%

RC RC Immix

fast

er

GC

sl

ow

er

jess db

javac

mtr

t

jack

avro

ra

blo

at

chart

ecl

ipse fop

hsq

ldb

jyth

on

luin

dex

luse

arc

hfix

pm

d

sunflow

xala

n

pjb

b2

00

5

com

pre

ss

33

Total Time v Heap Size

RCImmix matches GenImmix at 1.3x and outperforms from 1.4x

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 61

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

1.45

1.5

GenImmix RC RC Immix RC Immix (No PC)

Heap Size / Minimum Heap

Tim

e /

Be

st

34

Summary and Conclusion

• RC Immix – Combines RC and Immix

• Great performance– Outperforms fastest production

• Transforms RC

Questions?Series1

9%

Tota

l Tim

e v

Pro

du

cti

on

RC 2013

RC Immix

-3%

Available at: http://jira.codehaus.org/browse/RVM-1061


Recommended