Upload
myles-patrick
View
220
Download
0
Embed Size (px)
Citation preview
Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood
Presented by: Eduardo Cuervo
Previous TM systems abort fast, commit slow◦ Old values “in place”◦ New values somewhere else
Commit is the common case!◦ Remember Amdahl’s Law
Conflicts usually solved by hardware◦ Fast but myopic◦ Trapping to SW if needed for careful resolution
Eager version management◦ Puts new values in place for faster commits◦ No data moves even on cache overflow
Eager conflict detection◦ Detects offending ld/st immediately◦ Fast conflict detection on evicted blocks◦ Fast commit by lazy reset of directory state
Handle aborts by SW◦ Aborts are much less common than commits
Per-thread log in cacheable virtual memory◦ On st. logs address and previous contents of block
Write bit◦ Tracks if a block has been stored and logged
Faster commits◦ Clear W bits and reset log (pointer)
Slower aborts◦ Also has to write old values back
1 2 - - - - - - -
- - - - - - - 2 3
3 4 - - - - - - -
Virtual Address Data Block R W
LogBase
LogPtr
LogPtr
00
40
c0
1000
1040
1080
1000
1000 1
0 0
0 0
0 0
1 2 - - - - - - -
- - - - - - - 2 3
3 4 - - - - - - -
Virtual Address Data Block R W
LogBase
LogPtr
LogPtr
00
40
c0
1000
1040
1080
1000
1000 1
1 0
0 0
0 0
1 2 - - - - - - -
- - - - - - - 2 3
5 6 - - - - - - -
c 0 3 4 - - - - - -
-
Virtual Address Data Block R W
LogBase
LogPtr
LogPtr
00
40
c0
1000
1040
1080
1048
1000 1
1 0
0 0
0 1
1 2 - - - - - - -
- - - - - - - 2 4
5 6 - - - - - - -
c 0 3 4 - - - - - -
- 4 0 - - - - - -
- 2 3
Virtual Address Data Block R W
LogBase
LogPtr
LogPtr
00
40
c0
1000
1040
1080
1090
1000 1
1 0
1 1
0 1
1 2 - - - - - - -
- - - - - - - 2 4
5 6 - - - - - - -
c 0 3 4 - - - - - -
- 4 0 - - - - - -
- 2 3
Virtual Address Data Block R W
LogBase
LogPtr
LogPtr
00
40
c0
1000
1040
1080
1000
1000 0
0 0
0 0
0 0
1 2 - - - - - - -
- - - - - - - 2 3
3 4 - - - - - - -
c 0 3 4 - - - - - -
- 4 0 - - - - - -
- 2 3
Virtual Address Data Block R W
LogBase
LogPtr
LogPtr
00
40
c0
1000
1040
1080
1000
1000 0
0 0
0 0
0 0
Coherence requests sent to directory Directory will forward to other processor(s) Processors will detect conflict
◦ Using local state◦ Ack/Nack as response◦ Requester resolves any conflict
Adds read bit to each cache block Extends MOESI protocol
◦ “Sticky” states
Works even after cache overflow◦ Forward to conflicting requests to “interested”
processors Adds a per processor overflow bit
◦ The transactional block can be updated◦ Requests will still be redirected to the processor◦ Processor can Nack on conflict
Depends on MOESI state M: Replace with transactional writeback
◦ Sets state as “Sticky@Processor”◦ Requests are forwarded to the processor
S: Silently replaced,◦ Adds processor to sharer list◦ Requests forwarded to all sharers
O: Write back to directory◦ Add itself to sharer list, same as S if requested
exclusively E: Same as O
Directory
Idle [old]
P
TMcount: 1Overflow: 0
I (--) [none]
Directory
M@P [old]
P
TMcount: 1Overflow: 0
M (R W) [new]
GETX
DAT
A
ACK
Directory
M@P [old]
P
TMcount: 1Overflow: 0
M (R W) [new] Q
TMcount: 1Overflow: 0
I (- -) [ ]
Fwd_GETS
NACK
GETS
NACK
DirectoryM@P[new]
P
TMcount: 1Overflow: 1
I (- -) [ ]
PUTX
NACK
WB_
XACT
DirectoryM@P[new]
P
TMcount: 1Overflow: 1
I (- -) [ ]
GETS
Fwd_G
ETS
NACKQ
TMcount: 1Overflow: 0
I (- -) [ ]
NACK
DirectoryE@Q[new]
P
TMcount: 0Overflow: 0
I (- -) [ ]
GETS
Fwd_G
ETS
ACKQ
TMcount: 1Overflow: 0
E (R -) [new]DATA
CLEAN
Lazy clean up better if overflow is rare◦ Can be improved otherwise (i.e. use Bloom filters)
Ambiguities handled conservatively◦ Refetch during same against earlier transaction◦ Set R&W bits◦ Log old values
When two transactions conflict◦ At least one must stall or abort◦ Quick myopic decision by HW◦ Slow and careful by SW
Hybrid approach:◦ HW seeks fast solution, traps to software if
problem persists
Distributed timestamp Trap to conflict handler (SW)
◦ Transaction could cause deadlock◦ Logically later than transaction in conflict
Per processor possible cycle flag◦ Conflict if nack received from a logically earlier
transaction with possible cycle flag set
Target System◦ SPARC Solaris 32 Processors 1Ghz◦ L1: 16KB 4-way split, 1 cycle latency◦ L2: 4 MB 4-way unified, 12-cycle latency◦ Memory: 4GB 80-cycle latency◦ Directory: Full-bit vector sharer list, migratory
sharing optimization, directory cache, 6-cycle latency
◦ Interconnection: Hierarchical switch topology, 14-cycle link latency
Simulated using Simics◦ LogTM interface added by “magic” instructions
Shared counter micro-benchmark
Compared to ◦ Exponential Backoff◦ MCS locks
LogTM outperforms them
LogTM does not abort transactions
Evaluated using a subset of SPLASH-2
Used two versions of raytrace (with/without false sharing)
False sharing has significant impact!
Performance gains from moderate to large
LogTM must read a block before writing it to the log◦ Benchmarks showed that data is usually read
anyway LogTM is more sensitive to false sharing
than lock approaches Since the log is required to be valid only
until an abort◦ A k-block log write buffer reduces most writes as
shown in the benchmarks.
TCC ◦ Lazy version management (slow commits)◦ Lazy conflict detection (detect on commit)
LTM◦ On overflow stores new values in uncacheable in-
memory hash table◦ LogTM allows both old and new versions cached
UTM◦ Logs blocks targeted by both loads and stores◦ More complete conflict detection◦ Must walk log on certain coherence requests
VTM◦ Per address space virtual mode for cache
evictions, paging, context switches◦ Virtualized VTM uses micro-code for conflict
detection. (LogTM uses MOESI extension)
Presents a TM implementation designed to speed up the common case
Efficiently handles cache evictions Requires simple architectural changes
◦ Registers, state, directory extension Work towards hybrid conflict detection No paging or context switch support Very sensitive to false sharing