28
© 2005 IBM Corporation ISMM’06 Ottawa, Ontario, Canada June 10 th 2006 | ISMM’06 Ottawa, Ontario, Canada © 2006 IBM Corporation Improving Locality with Parallel Hierarchical Copying GC David Siegwart, IBM Software Group Martin Hirzel, IBM Watson Research Center

© 2005 IBM Corporation ISMM’06 Ottawa, Ontario, Canada June 10 th 2006 | ISMM’06 Ottawa, Ontario, Canada © 2006 IBM Corporation Improving Locality with

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

© 2005 IBM Corporation

ISMM’06 Ottawa, Ontario, Canada

June 10th 2006 | ISMM’06 Ottawa, Ontario, Canada © 2006 IBM Corporation

Improving Locality withParallel Hierarchical Copying GC

David Siegwart, IBM Software GroupMartin Hirzel, IBM Watson Research Center

2

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Talk Summary

Motivation

Background & Related Work

Hierarchical Copying GC, Parallelized.

Evaluation across wide range of benchmarks.

Conclusions

3

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Motivation

Improving Locality:– Commercial workloads spend 45% stalled in memory requests.

[Adl-Tabatabai et al, PLDI’04 - SPECjbb2000 on Itanium II]

– Object order in memory influences misses.

– Copying GC can relocate objects, changing object ordering.

– Objective: co-locate objects that are used together, on the same page or cache line.

Maintaining Scalability:– parallelism and workload balancing is essential for server workloads

4

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Related Objects are Used Together

Looked at Consecutive Field Accesses:– Siblings

– child-parent

for SPECjbb2005:– 29% siblings

– 14% child-parent

for a Trade6 Primitive: (J2EE Benchmark)– 36% siblings

– 8% child-parent

Copying GC should have:– good locality for siblings

– good locality for child-parent.

5

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Background

Cheney

Moon

Wilson/Lam/Moher

Halstead

Imai/Tick

Parallel Hierarchical

1970

1984

2006

1985

19931992

+ parallel

+ load balancing

+ hierarchical

– rescanning

6

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Cheney Copying GC – Good for Siblings

o1

o2 o3

o4 o5 o6 o7

o8 o9 o10 o11 o12 o13 o14 o15

Breadthfirst

scanfre

e

To-space

scan

parent

child

free

copied

copied & scanned

free

scan

scan fre

efre

esc

an

7

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

0%

5%

10%

15%

20%

25%

30%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Scanned Slot to Copied Object Distance

(Log22 )

Pro

po

rtio

n

Cheney (Breadth First)

Cheney Copying GC – Bad for Parent-Child(SPECjbb2005)

64 bytecache line

page size (4 kB)

– Increases working set, hence TLB misses and L2 cache misses

8

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Depth-First Copying – Good for Parent-Child

o1

o2 o3

o4 o5 o6 o7

o8 o9 o10 o11 o12 o13 o14 o15

– Bad for Siblings(o4, o5, o6, o7 are on separate pages)

9

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Background

Cheney

Moon

Wilson/Lam/Moher

Halstead

Imai/Tick

Parallel Hierarchical

1970

1984

2006

1985

19931992

+ parallel

+ load balancing

+ hierarchical

– rescanning

10

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Moon’s Hierarchical Copying GC

To-space

o8

o1

o2 o3

o4 o5 o6 o7

o9 o10 o11 o12 o13 o14 o15

freepar

tial

= scan

freepar

tial

= scan

Two scan pointers: scan, partial

scan fre

epar

tial

scan fre

epar

tial

scan

partia

l

= free

A B DC E

re-scanned

scan

partia

l

= fr

eescan

partia

l

= fr

ee

11

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Wilson, Lam & Moher’s Hierarchical Copying GC

o8

o1

o2 o3

o4 o5 o6 o7

o9 o10 o11 o12 o13 o14 o15

scan

Afre

esc

anB

scan

Csc

anD

scan

E

scan block = copy block

free

scan

Csc

anB

scan

Dsc

anE

scan

A

scan block = copy block

free

scan

Csc

anD

scan

Asc

anB

scan

E

scan block = copy block

free

scan

Asc

anB

scan

Csc

anD

scan

E

scan block = copy block

scan pointer in each block:avoids re-scanning

aliasing scan blockto copy block reducescopy-scan distances

To-spaceA B DC E

scan

C

= freesc

anB

scan

Asc

anD

scan

E

scan block ≠ copy block

scan

Esc

anD

scan

Asc

anB

scan

C

= free

scan block ≠ copy block

12

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Background

Cheney

Moon

Wilson/Lam/Moher

Halstead

Imai/Tick

Parallel Hierarchical

1970

1984

2006

1985

19931992

+ parallel

+ load balancing

+ hierarchical

– rescanning

13

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Imai and Tick’s Parallel Copying GCTo-space

. . .Work Pool

Thread 1

Thread 2

scan block ≠ copy block

scan block = copy block(aliased)

Thread n

. . .

14

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Recognising the Connection. . .Work Pool

Thread 1

Thread 2

scan block ≠ copy block

scan block = copy block(aliased)

Wilson, Lam & Moher(hierarchical, not parallel)

Imai & Tick(parallel, not hierarchical)

the immediacy of aliasing in WLM is what distinguishes it from Imai and Tick.

So immediate aliasing in Imai & Tick gives hierarchical copying.

Need to increase aliasing in Imai & Tick to improve locality.

15

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Immediate Aliasing

Check for aliasing opportunity immediately after each reference slot in each object has been scanned.

Interrupt scanning at this point, and restart with the aliased block

Easier to see via transition diagram

16

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Parallel Hierarchical – Block State Transitions

freelist copy

scan donescanlist

aliased

shared data

17

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Parallel Hierarchical – Block State Transitions

freelist copy

scan donescanlist

aliased

shared data

18

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

0%

5%

10%

15%

20%

25%

30%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Scanned Slot to Copied Object Distance(Log2)

Pro

po

rtio

n

Breadth-FirstHierarchical

Parent-Child Distances for Parallel Hierarchical(SPECjbb2005)

64 bytecache line

page size (4 kB)

– less TLB misses, less L2 cache misses

19

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Baseline GC

IBM J9 JVM, GC has two Generations:

Parallel copying for the young generation:– two semi-spaces

– most GC’s are of this type.

Concurrent mark for the old generation:– stop-the-world phase.

(rare, compared to young collection)

20

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

-10%

-5%

0%

5%

10%

15%

20%

25%

SP

EC

jbb

20

05

db

java

src

mtr

t

jbyt

em

ark

java

c

cha

rt

jpa

t

ba

nsh

ee

java

lex

jyth

on

ecl

ipse

mp

eg

au

dio

com

pre

ss fop

hsq

ldb

kaw

a

soo

t

ba

tik

jack

an

tlr

jess ps

blo

at

pm

d

ipsi

xql

% S

pe

ed

up

s (1

- P

H/B

F)

heap size 10x min, except SPECjbb2005

Results – 26 Benchmark Suite

21

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Results – Scalability SPECjbb2005

Windows 2000 Advanced Server 5.0.2195 SP44x(1.6GHz HT Pentium 4 Xeon)256kB L2 (64byte cache line), 1MB L3, 2GB RAMBase Build: J9 5.0 GA pwi32dev-20051104

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Warehouses

Thr

ough

putt

Hierarchical

Breadth-First

22

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

0

1

2

3

4

0 1 2 3 4 5 6 7 8

GC Threads

Nor

mal

ized

Tra

nsa

ctio

ns /

(G

C T

ime)

Breadth-FirstHierarchical

GC Scaling – SPECjbb2005

Windows 2000 Advanced Server 5.0.2195 SP44x(1.6GHz HT Pentium 4 Xeon)256kB L2 (64byte cache line), 1MB L3, 2GB RAMBase Build: J9 5.0 GA pwi32dev-20051104

23

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Mutator vs Collector - db

Linux1x(3.06 GHz HT Pentium 4 Xeon)512kB L2 (64byte cache line), 1GB RAMBase Build: J9 5.0 GA pxi32dev-20051104

Mutator Time

1

1.1

1.2

1.3

1.4

1.5

1 2 3 4 5 6 7 8 9 10Heap Size relative to minimum heap size

Nor

mal

ized

Mut

ator

Tim

e .

Hierarchical

Breadth-First

1

1.5

2

2.5

3

1 2 3 4 5 6 7 8 9 10Heap Size relative to minimum heap size

Nor

mal

ized

GC

Tim

e .

Hierarchical

Breadth-FirstGC Time

24

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Cache & TLB Misses - db

Linux1x(3.06 GHz HT Pentium 4 Xeon)512kB L2 (64byte cache line), 1GB RAMBase Build: J9 5.0 GA pxi32dev-20051104

1

1.1

1.2

1.3

1.4

1.5

1 2 3 4 5 6 7 8 9 10Heap Size relative to minimum heap size

Nor

mal

ized

Mut

ator

L1

Cac

he M

isse

s .

Hierarchical

Breadth-First

1

1.1

1.2

1.3

1.4

1.5

1 2 3 4 5 6 7 8 9 10Heap Size relative to minimum heap size

Nor

mal

ized

Mut

ator

TLB

Mis

ses

.

Hierarchical

Breadth-First

25

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Conclusions

Introduced a new algorithm:– Improves Memory Locality

– Maintains Good Scalability

Two technologies in one – hierarchical decomposition and parallel copying GC.

Requires no online profiling.

Evaluated across wide range of benchmarks:– better locality, dramatic reduction TLB misses, and also reduces L1 misses.

– cost on collector outweighed by benefit to mutator.

– Majority of benchmarks show improvements.

26

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Backup

27

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Related Work

Ch./La‘98

Huang ‘04

Shuf ‘02

Shuf’02Adl-T.

‘04

Latt-ner‘04

La./Ad.’05Ch./Hi.

‘01

Cascaval‘05

Moon‘84

Kistler/Fra.‘03

Wi/La/

Mo.’91

L1 L2 TLB Paging

L1 L2 TLB Paging

C/C++

Java

Lisp

C/C++

Java

Lisp

OS Allocator PrefetchingMoving GC

OS Allocator PrefetchingMoving GC

28

ISMM’06 Ottawa, Ontario, Canada

Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation

Results – 26 Benchmark Suite – other heap sizes

-10%

-5%

0%

5%

10%

15%

20%

25%

SP

EC

jbb

20

05

db

java

src

mtr

tjb

yte

ma

rkja

vac

cha

rtjp

at

ba

nsh

ee

java

lex

jyth

on

ecl

ipse

mp

eg

au

dio

com

pre

ss fop

hsq

ldb

kaw

aso

ot

ba

tikja

cka

ntlr

jess ps

blo

at

pm

dip

sixq

l

% S

pee

du

ps

(1 -

PH

/BF

)

1.33x2x4x10x