Upload
nguyenliem
View
222
Download
2
Embed Size (px)
Citation preview
Software Transactional Memory for Large Scale Clusters
Robert L. Bocchino Jr. and Vikram S. AdveUniversity of Illinois at Urbana-Champaign
Bradford L. ChamberlainCray Inc.
Transactional Memory (TM)
Can simplify parallel programming
Well studied for small-scale, cache-coherent platforms
No prior work on TM for large scale platforms
• Potentially thousands of processors
• Distributed memory, no cache coherence
• Slow communication between nodes
2
Why STM On Clusters?
TM is a natural fit for PGAS Languages
• UPC, CAF, Titanium, Chapel, Fortress, X10, ...
• Address space is global (unlike message passing)
• But data distribution is explicit (unlike cc-NUMA, DSM)
Commodity clusters are in widespread use
Software transactional memory (STM) is natural choice
• Communication done in software anyway
• Could leverage hardware TM support if it exists
3
What’s New About STM on Clusters?
Classic STM Cluster STM
4
What’s New About STM on Clusters?
Classic STM Cluster STM
4
Primary overhead Extra scalar ops Extra remote ops
What’s New About STM on Clusters?
Classic STM Cluster STM
4
Read and write Words of data Blocks of data
Primary overhead Extra scalar ops Extra remote ops
What’s New About STM on Clusters?
Classic STM Cluster STM
4
Heap address space Uniform Partitioned
Read and write Words of data Blocks of data
Primary overhead Extra scalar ops Extra remote ops
What’s New About STM on Clusters?
Classic STM Cluster STM
4
STM metadata Uniform Distributed
Heap address space Uniform Partitioned
Read and write Words of data Blocks of data
Primary overhead Extra scalar ops Extra remote ops
What’s New About STM on Clusters?
Classic STM Cluster STM
4
Distributing computation for data locality N/A on p { ... }
STM metadata Uniform Distributed
Heap address space Uniform Partitioned
Read and write Words of data Blocks of data
Primary overhead Extra scalar ops Extra remote ops
Research Contributions
First STM designed for high performance on large clusters
• Block data movement
• Computation migration
• Distributed metadata
Experimental evaluation of prototype
• Performance vs. locks
• New design tradeoffs
Decomposition of STM design space into eight axes
5
Outline
Interface Design
Algorithm Design
Evaluation
• Cluster STM vs. Manual Locking
• Read Locks vs. Read Validation
Conclusion
6
Interface Design
Compiler target, not primarily for programmers
Correct use guarantees serializability of transactions
7
Interface Design
Compiler target, not primarily for programmers
Correct use guarantees serializability of transactions
atomic {// Array assignmentcache = A;compute(cache);A = cache;
}
Chapel
7
Interface Design
Compiler target, not primarily for programmers
Correct use guarantees serializability of transactions
atomic {// Array assignmentcache = A;compute(cache);A = cache;
}
Chapel
stm_start(...)stm_get(&cache, &A,...)compute(&cache);stm_put(&A, &cache, ...);stm_commit(...);
Cluster STM API
compiler
7
Interface Summary
Transaction start and commit
Transactional memory allocation
Block data movement to and from transactional store
Remote execution of transactional computations
8
Block Data Movement
p1tx
non-
txp2
A
B
CAll ops occur on p1
9
Block Data Movement
p1tx
non-
txp2
stm_get(work_proc=p2,src=A, dest=B, size=n, ...)
A
B
CAll ops occur on p1
9
Block Data Movement
p1tx
non-
txp2
stm_put(work_proc=p2,src=B, dest=A, size=n, ...)
stm_get(work_proc=p2,src=A, dest=B, size=n, ...)
A
B
CAll ops occur on p1
9
Block Data Movement
p1tx
non-
txp2
stm_read(src=C, dest=B, size=n, ...)
stm_put(work_proc=p2,src=B, dest=A, size=n, ...)
stm_get(work_proc=p2,src=A, dest=B, size=n, ...)
A
B
CAll ops occur on p1
9
Block Data Movement
p1tx
non-
txp2
stm_read(src=C, dest=B, size=n, ...)
stm_put(work_proc=p2,src=B, dest=A, size=n, ...)
stm_write(src=B, dest=C, size=n, ...)
stm_get(work_proc=p2,src=A, dest=B, size=n, ...)
A
B
CAll ops occur on p1
9
Remote Work for Exploiting Locality
p1 p2
10
Remote Work for Exploiting Locality
p1 p2
remoteputs & gets
10
Remote Work for Exploiting Locality
p1 p2
10
Remote Work for Exploiting Locality
p1 p2
localreads & writes
10
Remote Work for Exploiting Locality
p1 p2
stm_on(work_proc=p2,function=f, ...)
localreads & writes
10
Remote Work for Exploiting Locality
p1
Chapel: on p2 { f(...); }
p2
stm_on(work_proc=p2,function=f, ...)
localreads & writes
10
Remote Work Inside Transaction
p1
atomic { on p2 { f(...); } }
p2
11
Remote Work Inside Transaction
p1
atomic { on p2 { f(...); } }
p2
stm_start(src_proc=p1)
11
Remote Work Inside Transaction
p1
atomic { on p2 { f(...); } }
p2
stm_on(src_proc=p1, work_proc=p2,function=f, ...)
stm_start(src_proc=p1)
11
Remote Work Inside Transaction
p1
atomic { on p2 { f(...); } }
p2
stm_on(src_proc=p1, work_proc=p2,function=f, ...)
stm_start(src_proc=p1)
stm_read(src_proc=p1)
11
Remote Work Inside Transaction
p1
atomic { on p2 { f(...); } }
p2
stm_on(src_proc=p1, work_proc=p2,function=f, ...)
stm_start(src_proc=p1)
stm_read(src_proc=p1)stm_write(src_proc=p1)
11
Remote Work Inside Transaction
p1
atomic { on p2 { f(...); } }
p2
stm_on(src_proc=p1, work_proc=p2,function=f, ...)
stm_start(src_proc=p1)
stm_commit(src_proc=1)
stm_read(src_proc=p1)stm_write(src_proc=p1)
11
Remote Work Inside Transaction
p1
atomic { on p2 { f(...); } }
p2
stm_on(src_proc=p1, work_proc=p2,function=f, ...)
stm_start(src_proc=p1)
stm_commit(src_proc=1)
stm_read(src_proc=p1)stm_write(src_proc=p1)
localops
11
Remote Work Inside Transaction
p1
atomic { on p2 { f(...); } }
p2
stm_on(src_proc=p1, work_proc=p2,function=f, ...)
stm_start(src_proc=p1)
stm_commit(src_proc=1)
stm_read(src_proc=p1)stm_write(src_proc=p1)
single commit message
localops
11
Transaction Inside Remote Work
p1
on p2 { atomic { f(...); } }
p2
12
Transaction Inside Remote Work
p1
on p2 { atomic { f(...); } }
p2
stm_on(src_proc=p1, work_proc=p2,function=f, ...)
12
Transaction Inside Remote Work
p1
on p2 { atomic { f(...); } }
p2
stm_on(src_proc=p1, work_proc=p2,function=f, ...)
stm_start(src_proc=p1)
12
Transaction Inside Remote Work
p1
on p2 { atomic { f(...); } }
p2
stm_on(src_proc=p1, work_proc=p2,function=f, ...) stm_read(src_proc=p1)
stm_start(src_proc=p1)
12
Transaction Inside Remote Work
p1
on p2 { atomic { f(...); } }
p2
stm_on(src_proc=p1, work_proc=p2,function=f, ...) stm_read(src_proc=p1)
stm_write(src_proc=p1)
stm_start(src_proc=p1)
12
Transaction Inside Remote Work
p1
on p2 { atomic { f(...); } }
p2
stm_on(src_proc=p1, work_proc=p2,function=f, ...) stm_read(src_proc=p1)
stm_write(src_proc=p1)
stm_start(src_proc=p1)
stm_commit(src_proc=p1)
12
Transaction Inside Remote Work
p1
on p2 { atomic { f(...); } }
p2
stm_on(src_proc=p1, work_proc=p2,function=f, ...) stm_read(src_proc=p1)
stm_write(src_proc=p1)
stm_start(src_proc=p1)
stm_commit(src_proc=p1)
12
Transaction Inside Remote Work
p1
on p2 { atomic { f(...); } }
p2
stm_on(src_proc=p1, work_proc=p2,function=f, ...) stm_read(src_proc=p1)
stm_write(src_proc=p1)
stm_start(src_proc=p1)
stm_commit(src_proc=p1)
localops
12
Outline
Interface Design
Algorithm Design
Evaluation
• Cluster STM vs. Manual Locking
• Read Locks vs. Read Validation
Conclusion
13
Algorithm Design
Shared metadata
Transaction-local metadata
STM design choices
14
Shared Metadata
Some metadata must be visible to all transactions
• Read and write locks
• Validation ID
Memory overhead compromise
• Each metadata word guards one conflict detection unit (CDU)
• CDU size is s ≥ 1 words
• s > 1 may introduce false sharing
15
Shared Metadata
Some metadata must be visible to all transactions
• Read and write locks
• Validation ID
Memory overhead compromise
• Each metadata word guards one conflict detection unit (CDU)
• CDU size is s ≥ 1 words
• s > 1 may introduce false sharing
Keep metadata on same processor as corresponding CDU
15
Transaction-Local Metadata
p1 p2
stm_on(src_proc=p1)stm_start(src_proc=p1)
stm_commit(src_proc=p1)
stm_read(src_proc=p1)
stm_write(src_proc=p1)
16
Transaction-Local Metadata
p1 p2
stm_on(src_proc=p1)stm_start(src_proc=p1)
stm_commit(src_proc=p1)
stm_read(src_proc=p1)
stm_write(src_proc=p1)
p2 stores tx metadata for p1
16
Transaction-Local Metadata
p1 p2
stm_on(src_proc=p1)stm_start(src_proc=p1)
stm_commit(src_proc=p1)
stm_read(src_proc=p1)
stm_write(src_proc=p1)
p2 uses stored metadata to commit on
behalf of p1
p2 stores tx metadata for p1
16
Transaction-Local Metadata
p1 p2
stm_on(src_proc=p1)stm_start(src_proc=p1)
stm_commit(src_proc=p1)
stm_read(src_proc=p1)
stm_write(src_proc=p1)
p2 uses stored metadata to commit on
behalf of p1
p2 stores tx metadata for p1
Keep metadata local to processor where access occurred
16
STM Design Choices
Eight-axis design space (see paper)
Choose four axes to explore
• CDU size
• Read locks (RL) vs. read validation (RV)
• Undo log (UL) vs. write buffer (WB)
• Early acquire (EA) vs. late acquire (LA)
17
STM Design Choices
Eight-axis design space (see paper)
Choose four axes to explore
• CDU size
• Read locks (RL) vs. read validation (RV)
• Undo log (UL) vs. write buffer (WB)
• Early acquire (EA) vs. late acquire (LA)
Some tradeoffs come out differently on clusters
17
STM Design Choices
Eight-axis design space (see paper)
Choose four axes to explore
• CDU size
• Read locks (RL) vs. read validation (RV)
• Undo log (UL) vs. write buffer (WB)
• Early acquire (EA) vs. late acquire (LA)
Some tradeoffs come out differently on clusters
Discuss RL vs. RV here
17
STM Design Choices
Eight-axis design space (see paper)
Choose four axes to explore
• CDU size
• Read locks (RL) vs. read validation (RV)
• Undo log (UL) vs. write buffer (WB)
• Early acquire (EA) vs. late acquire (LA)
Some tradeoffs come out differently on clusters
Discuss RL vs. RV here
See paper for additional results
17
Outline
Interface Design
Algorithm Design
Evaluation
• Cluster STM vs. Manual Locking
• Read Locks vs. Read Validation
Conclusion
18
Evaluation
Benchmarks
• Micro Benchmarks: Intset, Hashmap Swap
• Graph clustering: SSCA 2 Kernel 4
Machine
• Intel Xeon cluster
• Two cores per node
• Myrinet
19
Outline
Interface Design
Algorithm Design
Evaluation
• Cluster STM vs. Manual Locking
• Read Locks vs. Read Validation
Conclusion
20
1
10
100
1000
1 2 4 8 16 32 64 128 256 512
Tim
e (
seconds, lo
g s
cale
)
Number of processors (log scale)
Locks, 6M operationsSTM, 6M operations
Locks, 100M operationsSTM, 100M operations
Intset Results
21
1
10
100
1000
1 2 4 8 16 32 64 128 256 512
Tim
e (
seconds, lo
g s
cale
)
Number of processors (log scale)
Locks, 6M operationsSTM, 6M operations
Locks, 100M operationsSTM, 100M operations
Intset Results
Switching problem sizes at p = 8
21
1
10
100
1000
1 2 4 8 16 32 64 128 256 512
Tim
e (
seconds, lo
g s
cale
)
Number of processors (log scale)
Locks, 6M operationsSTM, 6M operations
Locks, 100M operationsSTM, 100M operations
Intset Results
21
1
10
100
1000
1 2 4 8 16 32 64 128 256 512
Tim
e (
seconds, lo
g s
cale
)
Number of processors (log scale)
Locks, 6M operationsSTM, 6M operations
Locks, 100M operationsSTM, 100M operations
Intset Results
Bump at p < 4
21
1
10
100
1000
1 2 4 8 16 32 64 128 256 512
Tim
e (
seconds, lo
g s
cale
)
Number of processors (log scale)
Locks, 6M operationsSTM, 6M operations
Locks, 100M operationsSTM, 100M operations
Intset Results
21
1
10
100
1000
1 2 4 8 16 32 64 128 256 512
Tim
e (
seconds, lo
g s
cale
)
Number of processors (log scale)
Locks, 6M operationsSTM, 6M operations
Locks, 100M operationsSTM, 100M operations
Intset Results
Good scaling after p = 4Locks, STM about the same
21
SSCA 2 Results
0.1
1
10
100
1 2 4 8 16 32 64 128 256 512
Tim
e (
seconds, lo
g s
cale
)
Number of processors (log scale)
locks STM22
SSCA 2 Results
0.1
1
10
100
1 2 4 8 16 32 64 128 256 512
Tim
e (
seconds, lo
g s
cale
)
Number of processors (log scale)
locks STM
Bump is smaller
22
SSCA 2 Results
0.1
1
10
100
1 2 4 8 16 32 64 128 256 512
Tim
e (
seconds, lo
g s
cale
)
Number of processors (log scale)
locks STM22
SSCA 2 Results
0.1
1
10
100
1 2 4 8 16 32 64 128 256 512
Tim
e (
seconds, lo
g s
cale
)
Number of processors (log scale)
locks STM
Good scaling after p = 4STM overhead about 2.5x
22
Outline
Interface Design
Algorithm Design
Evaluation
• Cluster STM vs. Manual Locking
• Read Locks vs. Read Validation
Conclusion
23
Implementation
Read locks (RL)
• Metadata word holds write bit and reader count
• Abort on attempt to read or write conflicting CDU
Read validation (RV)
• Metadata word holds write bit and validation ID
• Increment ID with each write
• Abort at commit if ID changes after CDU is read
24
Implementation
Read locks (RL)
• Metadata word holds write bit and reader count
• Abort on attempt to read or write conflicting CDU
Read validation (RV)
• Metadata word holds write bit and validation ID
• Increment ID with each write
• Abort at commit if ID changes after CDU is read
Validation requires extra remote op at commit
24
Implementation
Read locks (RL)
• Metadata word holds write bit and reader count
• Abort on attempt to read or write conflicting CDU
Read validation (RV)
• Metadata word holds write bit and validation ID
• Increment ID with each write
• Abort at commit if ID changes after CDU is read
Validation requires extra remote op at commit
No global cache ⇒ no helpful cache effects
24
Ratio of RV to RL Runtimes
0
1
2
3
4
1 2 4 8 16 32 64 128 256 512
Intset SSCA 2
Number of processors
Rat
io
25
Results Summary
Good scalability to p = 512
Good performance
• Nearly identical to locks on micro benchmarks
• Some STM overhead for SSCA2
Different design tradeoffs
• RL outperforms RV
• EA outperforms LA (see paper)
• Minimal penalty for WB (see paper)
26
Outline
Interface Design
Algorithm Design
Evaluation
Conclusion
27
Conclusion
Presented design and evaluation of Cluster STM
• First STM for high performance on large scale clusters
• Good performance, scaling to p = 512
• New evaluation of design tradeoffs
Future work
• Exploiting shared memory within a node
• Nested parallelism in a transaction
• Dynamic spawning of threads
28
Thank You
29