A Mostly Non-Copying Real-Time Collector with Low Overhead and Consistent Utilization

A Mostly Non-Copying Real-Time Collector with Low Overhead and Consistent Utilization

David BaconPerry Cheng (presenting)V.T. Rajan

IBM T.J. Watson Research

What is Real-time Garbage Collection? Pause Time, CPU utilization (MMU), and

Space Usage Heap Architecture

Types of Fragmentation Incremental Compaction Read Barriers Barrier Performance

Scheduling: Time-Based vs. Work-Based Empirical Results

Pause Time Distribution Minimum Mutator Utilization (MMU) Pause Times

Summary and Conclusion

Roadmap

Real-time Embedded Systems Memory usage important

Uniprocessor

Problem Domain

3 Styles of Uniprocessor Garbage Collection:Stop-the-World vs. Incremental vs. Real-Time

STW

Inc

RT

time

Pause Times (Average and Maximum)

STW

Inc

RT

1.5s 1.7s

0.5s 0.7s 0.3s 0.5s 0.9s 0.3s

0.15 - 0.19 s

1.6s

0.5s

0.18s

Coarse-Grained Utilization vs. Time

0

0.2

0.4

0.6

0.8

1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8

Time (s)

Uti

liza

tio

n (

%)

STW

Inc

RT

2.0 s window

Fine-Grained Utilization vs. Time

STW

Inc

RT

0

0.2

0.4

0.6

0.8

1

0

0.25 0.5

0.75 1

1.25 1.5

1.75 2

2.25 2.5

2.75 3

3.25 3.5

3.75 4

4.25 4.5

4.75 5

5.25 5.5

5.75 6

6.25 6.5

6.75 7

7.25 7.5

7.75 8

Time (s)

Uti

liza

tio

n

0.4 s window

Minimum Mutator Utilization (MMU)

STW

Inc

RT

0

20

40

60

80

100

Window Size (s) - logarithmic scale

MM

U

Space Usage over Time

0

10

20

30

40

50

60

70

80

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

Time (s)

Use

d S

pace

(M

b)

STW

Inc

RTmax live

trigger

2 X max live

Problems with Existing RT Collectors

0

20

40

60

80

100

0. 0 0. 5 1. 0 1. 5 2. 0 2. 5 3. 0 3. 5 4. 0 4. 5 5. 0 5. 5 6. 0 6. 5 7. 0 7. 5 8. 0

T i me (s )

Spa

ce (M

b)

max live2 X max live3 X max live4 X max live

Non-moving Collector

0

20

40

60

80

100

T i me (s )

MM

U

0

20

40

60

80

100

0. 0 0. 5 1. 0 1. 5 2. 0 2. 5 3. 0 3. 5 4. 0 4. 5 5. 0 5. 5 6. 0 6. 5 7. 0 7. 5 8. 0

T i me (s )

Spa

ce (M

b)

max live2 X max live3 X max live4 X max live

Replicating Collector

Not fully incremental,Tight coupling,Work-based scheduling

Our Collector Goals Results

Real-Time ~10 ms Low Space Overhead ~2X Good Utilization during GC ~ 40%

Solution Incremental Mark-Sweep Collector Write barrier – snapshot-at-the-beginning [Yuasa] Segregated free list heap architecture Read Barrier – to support defragmentation [Brooks]

Incremental defragmentation Segmented arrays – to bound fragmentation

What is Real-time Garbage Collection? Pause Time, CPU utilization (MMU), and Space Usage

Heap Architecture Types of Fragmentation Incremental Compaction Read Barriers Barrier Performance




Roadmap

Fragmentation and Compaction

Intuitively: available but unusable memory

avoidance and coalescing - no guarantees compaction

used

needed

free

Heap Architecture Segregated Free Lists

– heap divided into pages– each page has equally-sizes blocks (1 object

per block)– Large arrays are segmented

used free

sz 24

sz 32

external

internal page-internal

Controlling Internal and Page-Internal Fragmentation

Choose page size (page) and block sizes (sk)

If sk = sk-1 (1 + ), internal fragmentation

page-internal fragmentation page / smax

E.g. If page = 16K, = 1/8, smax= 2K, maximum non-external fragmentation to 12.5%.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

dbja

ck

java

cje

ssm

trt

mpeg

audi

o

com

press

Internal Page-Internal External Recently Dead Live

Fragmentation - small heap ( = 1/8 vs.

= 1/2)

=1/8 =1/2

Incremental Compaction

Compact only a part of the heapRequires knowing what to compact ahead of time

Key ProblemsPopular objectsDetermining references to moved objects

used

Incremental Compaction: Redirection

Access all objects via per-object redirection pointers

Redirection is initially self-referential

Move an object by updating ONE redirection pointer

original replica

Consistency via Read Barrier [Brooks]

Correctness requires always using the replica

E.g. field selection must be modified

x[offset]

x

x[redirect][offset]

x

normal access

read barrier access

x

Some Important Details Our read barrier is decoupled from collection Complication: In Java, any reference might be null

actual read barrier for GetField(x,offset) must be augmented

tmp = x[offset];return (tmp == null) ? null : tmp[redirect]

CSE, code motion (LICM and sinking), null-check combining

Barrier Variants - when to redirectlazy - easier for collectoreager - better for optimization

Barrier Overhead to Mutator Conventional wisdom says read barriers are too

expensiveStudies found overhead of 20-40% (Zorn, Nielsen)Our barrier has 4-6% overhead with optimizations

0

2

4

6

8

10

12

com

press

jess db

java

c

mpeg

audio

mtrt

jack

Geo. M

ean

Lazy

Eager

Heap (one size only)Stack

Program Start

HeapStack

free

allocated

Program is allocating

HeapStack

free

unmarked

GC starts

HeapStack

free

unmarked

marked orallocated

Program allocating and GC marking

HeapStack

free

unmarked

marked orallocated

Sweeping away blocks

HeapStack

free

allocated

evacuated

GC moving objects and installing redirection

HeapStack

free

unmarked

evacuated

marked orallocated

2nd GC starts tracing and redirection fixup

HeapStack

free

allocated

2nd GC complete






Roadmap

Scheduling the Collector Scheduling Issues

bad CPU utilization and space usage loose program and collector coupling

Time-Based Trigger the collector to run for CT seconds whenever the program runs for QT seconds

Work-Based Trigger the collector to collect CW work whenever the program allocate QW bytes

Time-Based Scheduling

Trigger the collector to run for CT seconds whenever the program runs for QT seconds

Sp

ace

(M

b)

Time (s)

0

10

20

30

40

50

60

70

80

90

100

Smooth Alloc Uneven Alloc High Alloc

0

0.2

0.4

0.6

0.8

1

Any

MM

U (

CP

U

Uti

liza

tio

n)

Window Size (s)

Work-Based Scheduling

0

0.2

0.4

0.6

0.8

1

Smooth Alloc Uneven Alloc

High Alloc

MM

U (

CP

U

Uti

liza

tio

n)

Trigger the collector to collect CW bytes whenever the program allocates QW bytes

Window Size (s)

0

20

40

60

80

100

Any

Sp

ace

(M

b)

Time (s)






Roadmap

Pause Time Distribution for javac

(Time-Based vs. Work-Based)

12 ms 12 ms

Utilization vs. Time for javac


Uti

liza

tio

n

(%)

Time (s) Time (s)

0.4

0.2

0

0.6

0.8

1.0

0.4

0.2

0

0.6

0.8

1.0

Uti

liza

tio

n

(%)

0.45

Minimum Mutator Utilization for javac


Space Usage for javac (Time-Based vs. Work-

Based)

3 inter-related factors:Space Bound (tradeoff)Utilization (tradeoff)Allocation Rate (lower is better)

Other factorsCollection rate (higher is better)Pointer density (lower is better)

Intrinsic Tradeoff

Summary: Mostly Non-moving RT GC

Read Barriers Permits incremental defragmentation Overhead is 4-6% with compiler optimizations

Low Space Overhead Space usage is only about 2 X max live data

Fragmentation still bounded Consistent Utilization

Always at least 45% at 12 ms resolution

Conclusions Real-time GC is real

There are tradeoffs just like in traditional GC

Scheduling should be primarily time-based

Fallback to work-based due to user’s incorrect parameter estimations

Incremental defragmentation is possible

Compiler support is important!

Future Work Lowering the real-time resolution

Sub-millisecond worst-case pause Main issue: breaking up stack scan

Segmented array optimizations Reduce segmented array cost below ~2%

Opportunistic contiguous layout Type-based specialization with invalidation

Strip-mining

Documents

A Mostly Non-Copying Real-Time Collector with Low Overhead and Consistent Utilization