Software Transactional Memory for Large Scale Clusters …research.ihost.com/ppopp08/presentations/bocchino.pdf · Software Transactional Memory for Large Scale Clusters Robert L

Software Transactional Memory for Large Scale Clusters

Robert L. Bocchino Jr. and Vikram S. AdveUniversity of Illinois at Urbana-Champaign

Bradford L. ChamberlainCray Inc.

Transactional Memory (TM)

Can simplify parallel programming

Well studied for small-scale, cache-coherent platforms

No prior work on TM for large scale platforms

• Potentially thousands of processors

• Distributed memory, no cache coherence

• Slow communication between nodes

2

Why STM On Clusters?

TM is a natural fit for PGAS Languages

• UPC, CAF, Titanium, Chapel, Fortress, X10, ...

• Address space is global (unlike message passing)

• But data distribution is explicit (unlike cc-NUMA, DSM)

Commodity clusters are in widespread use

Software transactional memory (STM) is natural choice

• Communication done in software anyway

• Could leverage hardware TM support if it exists

3

What’s New About STM on Clusters?

Classic STM Cluster STM

4



4

Primary overhead Extra scalar ops Extra remote ops



4

Read and write Words of data Blocks of data




4

Heap address space Uniform Partitioned





4

STM metadata Uniform Distributed






4

Distributing computation for data locality N/A on p { ... }

STM metadata Uniform Distributed




Research Contributions

First STM designed for high performance on large clusters

• Block data movement

• Computation migration

• Distributed metadata

Experimental evaluation of prototype

• Performance vs. locks

• New design tradeoffs

Decomposition of STM design space into eight axes

5

Outline

Interface Design

Algorithm Design

Evaluation

• Cluster STM vs. Manual Locking

• Read Locks vs. Read Validation

Conclusion

6

Interface Design

Compiler target, not primarily for programmers

Correct use guarantees serializability of transactions

7

Interface Design



atomic {// Array assignmentcache = A;compute(cache);A = cache;

}

Chapel

7

Interface Design



atomic {// Array assignmentcache = A;compute(cache);A = cache;

}

Chapel

stm_start(...)stm_get(&cache, &A,...)compute(&cache);stm_put(&A, &cache, ...);stm_commit(...);

Cluster STM API

compiler

7

Interface Summary

Transaction start and commit

Transactional memory allocation

Block data movement to and from transactional store

Remote execution of transactional computations

8

Block Data Movement

p1tx

non-

txp2

A

B

CAll ops occur on p1

9

Block Data Movement

p1tx

non-

txp2

stm_get(work_proc=p2,src=A, dest=B, size=n, ...)

A

B


9

Block Data Movement

p1tx

non-

txp2

stm_put(work_proc=p2,src=B, dest=A, size=n, ...)


A

B


9

Block Data Movement

p1tx

non-

txp2

stm_read(src=C, dest=B, size=n, ...)



A

B


9

Block Data Movement

p1tx

non-

txp2

stm_read(src=C, dest=B, size=n, ...)


stm_write(src=B, dest=C, size=n, ...)


A

B


9

Remote Work for Exploiting Locality

p1 p2

10


p1 p2

remoteputs & gets

10


p1 p2

10


p1 p2

localreads & writes

10


p1 p2

stm_on(work_proc=p2,function=f, ...)

localreads & writes

10


p1

Chapel: on p2 { f(...); }

p2

stm_on(work_proc=p2,function=f, ...)

localreads & writes

10

Remote Work Inside Transaction

p1

atomic { on p2 { f(...); } }

p2

11


p1

atomic { on p2 { f(...); } }

p2

stm_start(src_proc=p1)

11


p1

atomic { on p2 { f(...); } }

p2

stm_on(src_proc=p1, work_proc=p2,function=f, ...)


11


p1

atomic { on p2 { f(...); } }

p2



stm_read(src_proc=p1)

11


p1

atomic { on p2 { f(...); } }

p2



stm_read(src_proc=p1)stm_write(src_proc=p1)

11


p1

atomic { on p2 { f(...); } }

p2



stm_commit(src_proc=1)


11


p1

atomic { on p2 { f(...); } }

p2





localops

11


p1

atomic { on p2 { f(...); } }

p2





single commit message

localops

11

Transaction Inside Remote Work

p1

on p2 { atomic { f(...); } }

p2

12


p1

on p2 { atomic { f(...); } }

p2


12


p1

on p2 { atomic { f(...); } }

p2



12


p1

on p2 { atomic { f(...); } }

p2

stm_on(src_proc=p1, work_proc=p2,function=f, ...) stm_read(src_proc=p1)


12


p1

on p2 { atomic { f(...); } }

p2


stm_write(src_proc=p1)


12


p1

on p2 { atomic { f(...); } }

p2




stm_commit(src_proc=p1)

12


p1

on p2 { atomic { f(...); } }

p2





12


p1

on p2 { atomic { f(...); } }

p2





localops

12

Outline

Interface Design

Algorithm Design

Evaluation



Conclusion

13

Algorithm Design

Shared metadata

Transaction-local metadata

STM design choices

14

Shared Metadata

Some metadata must be visible to all transactions

• Read and write locks

• Validation ID

Memory overhead compromise

• Each metadata word guards one conflict detection unit (CDU)

• CDU size is s ≥ 1 words

• s > 1 may introduce false sharing

15

Shared Metadata

Some metadata must be visible to all transactions

• Read and write locks

• Validation ID

Memory overhead compromise

• Each metadata word guards one conflict detection unit (CDU)

• CDU size is s ≥ 1 words

• s > 1 may introduce false sharing

Keep metadata on same processor as corresponding CDU

15

Transaction-Local Metadata

p1 p2

stm_on(src_proc=p1)stm_start(src_proc=p1)




16


p1 p2





p2 stores tx metadata for p1

16


p1 p2





p2 uses stored metadata to commit on

behalf of p1


16


p1 p2





p2 uses stored metadata to commit on

behalf of p1


Keep metadata local to processor where access occurred

16

STM Design Choices

Eight-axis design space (see paper)

Choose four axes to explore

• CDU size

• Read locks (RL) vs. read validation (RV)

• Undo log (UL) vs. write buffer (WB)

• Early acquire (EA) vs. late acquire (LA)

17

STM Design Choices



• CDU size




Some tradeoffs come out differently on clusters

17

STM Design Choices



• CDU size





Discuss RL vs. RV here

17

STM Design Choices



• CDU size





Discuss RL vs. RV here

See paper for additional results

17

Outline

Interface Design

Algorithm Design

Evaluation



Conclusion

18

Evaluation

Benchmarks

• Micro Benchmarks: Intset, Hashmap Swap

• Graph clustering: SSCA 2 Kernel 4

Machine

• Intel Xeon cluster

• Two cores per node

• Myrinet

19

Outline

Interface Design

Algorithm Design

Evaluation



Conclusion

20

1

10

100

1000

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)

Number of processors (log scale)

Locks, 6M operationsSTM, 6M operations


Intset Results

21

1

10

100

1000

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)




Intset Results

Switching problem sizes at p = 8

21

1

10

100

1000

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)




Intset Results

21

1

10

100

1000

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)




Intset Results

Bump at p < 4

21

1

10

100

1000

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)




Intset Results

21

1

10

100

1000

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)




Intset Results

Good scaling after p = 4Locks, STM about the same

21

SSCA 2 Results

0.1

1

10

100

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)


locks STM22

SSCA 2 Results

0.1

1

10

100

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)


locks STM

Bump is smaller

22

SSCA 2 Results

0.1

1

10

100

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)


locks STM22

SSCA 2 Results

0.1

1

10

100

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)


locks STM

Good scaling after p = 4STM overhead about 2.5x

22

Outline

Interface Design

Algorithm Design

Evaluation



Conclusion

23

Implementation

Read locks (RL)

• Metadata word holds write bit and reader count

• Abort on attempt to read or write conflicting CDU

Read validation (RV)

• Metadata word holds write bit and validation ID

• Increment ID with each write

• Abort at commit if ID changes after CDU is read

24

Implementation

Read locks (RL)







Validation requires extra remote op at commit

24

Implementation

Read locks (RL)







Validation requires extra remote op at commit

No global cache ⇒ no helpful cache effects

24

Ratio of RV to RL Runtimes

0

1

2

3

4

1 2 4 8 16 32 64 128 256 512

Intset SSCA 2

Number of processors

Rat

io

25

Results Summary

Good scalability to p = 512

Good performance

• Nearly identical to locks on micro benchmarks

• Some STM overhead for SSCA2

Different design tradeoffs

• RL outperforms RV

• EA outperforms LA (see paper)

• Minimal penalty for WB (see paper)

26

Outline

Interface Design

Algorithm Design

Evaluation

Conclusion

27

Conclusion

Presented design and evaluation of Cluster STM

• First STM for high performance on large scale clusters

• Good performance, scaling to p = 512

• New evaluation of design tradeoffs

Future work

• Exploiting shared memory within a node

• Nested parallelism in a transaction

• Dynamic spawning of threads

28

Thank You

29

Documents

Software Transactional Memory for Large Scale Clusters …research.ihost.com/ppopp08/presentations/bocchino.pdf · Software Transactional Memory for Large Scale Clusters Robert L