31
Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Embed Size (px)

Citation preview

Page 1: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood

Presented by: Eduardo Cuervo

Page 2: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Previous TM systems abort fast, commit slow◦ Old values “in place”◦ New values somewhere else

Commit is the common case!◦ Remember Amdahl’s Law

Conflicts usually solved by hardware◦ Fast but myopic◦ Trapping to SW if needed for careful resolution

Page 3: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo
Page 4: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Eager version management◦ Puts new values in place for faster commits◦ No data moves even on cache overflow

Eager conflict detection◦ Detects offending ld/st immediately◦ Fast conflict detection on evicted blocks◦ Fast commit by lazy reset of directory state

Handle aborts by SW◦ Aborts are much less common than commits

Page 5: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Per-thread log in cacheable virtual memory◦ On st. logs address and previous contents of block

Write bit◦ Tracks if a block has been stored and logged

Faster commits◦ Clear W bits and reset log (pointer)

Slower aborts◦ Also has to write old values back

Page 6: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

1 2 - - - - - - -

- - - - - - - 2 3

3 4 - - - - - - -

Virtual Address Data Block R W

LogBase

LogPtr

LogPtr

00

40

c0

1000

1040

1080

1000

1000 1

0 0

0 0

0 0

Page 7: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

1 2 - - - - - - -

- - - - - - - 2 3

3 4 - - - - - - -

Virtual Address Data Block R W

LogBase

LogPtr

LogPtr

00

40

c0

1000

1040

1080

1000

1000 1

1 0

0 0

0 0

Page 8: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

1 2 - - - - - - -

- - - - - - - 2 3

5 6 - - - - - - -

c 0 3 4 - - - - - -

-

Virtual Address Data Block R W

LogBase

LogPtr

LogPtr

00

40

c0

1000

1040

1080

1048

1000 1

1 0

0 0

0 1

Page 9: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

1 2 - - - - - - -

- - - - - - - 2 4

5 6 - - - - - - -

c 0 3 4 - - - - - -

- 4 0 - - - - - -

- 2 3

Virtual Address Data Block R W

LogBase

LogPtr

LogPtr

00

40

c0

1000

1040

1080

1090

1000 1

1 0

1 1

0 1

Page 10: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

1 2 - - - - - - -

- - - - - - - 2 4

5 6 - - - - - - -

c 0 3 4 - - - - - -

- 4 0 - - - - - -

- 2 3

Virtual Address Data Block R W

LogBase

LogPtr

LogPtr

00

40

c0

1000

1040

1080

1000

1000 0

0 0

0 0

0 0

Page 11: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

1 2 - - - - - - -

- - - - - - - 2 3

3 4 - - - - - - -

c 0 3 4 - - - - - -

- 4 0 - - - - - -

- 2 3

Virtual Address Data Block R W

LogBase

LogPtr

LogPtr

00

40

c0

1000

1040

1080

1000

1000 0

0 0

0 0

0 0

Page 12: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Coherence requests sent to directory Directory will forward to other processor(s) Processors will detect conflict

◦ Using local state◦ Ack/Nack as response◦ Requester resolves any conflict

Adds read bit to each cache block Extends MOESI protocol

◦ “Sticky” states

Page 13: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Works even after cache overflow◦ Forward to conflicting requests to “interested”

processors Adds a per processor overflow bit

◦ The transactional block can be updated◦ Requests will still be redirected to the processor◦ Processor can Nack on conflict

Page 14: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Depends on MOESI state M: Replace with transactional writeback

◦ Sets state as “Sticky@Processor”◦ Requests are forwarded to the processor

S: Silently replaced,◦ Adds processor to sharer list◦ Requests forwarded to all sharers

O: Write back to directory◦ Add itself to sharer list, same as S if requested

exclusively E: Same as O

Page 15: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Directory

Idle [old]

P

TMcount: 1Overflow: 0

I (--) [none]

Page 16: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Directory

M@P [old]

P

TMcount: 1Overflow: 0

M (R W) [new]

GETX

DAT

A

ACK

Page 17: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Directory

M@P [old]

P

TMcount: 1Overflow: 0

M (R W) [new] Q

TMcount: 1Overflow: 0

I (- -) [ ]

Fwd_GETS

NACK

GETS

NACK

Page 18: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

DirectoryM@P[new]

P

TMcount: 1Overflow: 1

I (- -) [ ]

PUTX

NACK

WB_

XACT

Page 19: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

DirectoryM@P[new]

P

TMcount: 1Overflow: 1

I (- -) [ ]

GETS

Fwd_G

ETS

NACKQ

TMcount: 1Overflow: 0

I (- -) [ ]

NACK

Page 20: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

DirectoryE@Q[new]

P

TMcount: 0Overflow: 0

I (- -) [ ]

GETS

Fwd_G

ETS

ACKQ

TMcount: 1Overflow: 0

E (R -) [new]DATA

CLEAN

Page 21: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Lazy clean up better if overflow is rare◦ Can be improved otherwise (i.e. use Bloom filters)

Ambiguities handled conservatively◦ Refetch during same against earlier transaction◦ Set R&W bits◦ Log old values

Page 22: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo
Page 23: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

When two transactions conflict◦ At least one must stall or abort◦ Quick myopic decision by HW◦ Slow and careful by SW

Hybrid approach:◦ HW seeks fast solution, traps to software if

problem persists

Page 24: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Distributed timestamp Trap to conflict handler (SW)

◦ Transaction could cause deadlock◦ Logically later than transaction in conflict

Per processor possible cycle flag◦ Conflict if nack received from a logically earlier

transaction with possible cycle flag set

Page 25: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Target System◦ SPARC Solaris 32 Processors 1Ghz◦ L1: 16KB 4-way split, 1 cycle latency◦ L2: 4 MB 4-way unified, 12-cycle latency◦ Memory: 4GB 80-cycle latency◦ Directory: Full-bit vector sharer list, migratory

sharing optimization, directory cache, 6-cycle latency

◦ Interconnection: Hierarchical switch topology, 14-cycle link latency

Simulated using Simics◦ LogTM interface added by “magic” instructions

Page 26: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Shared counter micro-benchmark

Compared to ◦ Exponential Backoff◦ MCS locks

LogTM outperforms them

LogTM does not abort transactions

Page 27: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Evaluated using a subset of SPLASH-2

Used two versions of raytrace (with/without false sharing)

False sharing has significant impact!

Performance gains from moderate to large

Page 28: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

LogTM must read a block before writing it to the log◦ Benchmarks showed that data is usually read

anyway LogTM is more sensitive to false sharing

than lock approaches Since the log is required to be valid only

until an abort◦ A k-block log write buffer reduces most writes as

shown in the benchmarks.

Page 29: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

TCC ◦ Lazy version management (slow commits)◦ Lazy conflict detection (detect on commit)

LTM◦ On overflow stores new values in uncacheable in-

memory hash table◦ LogTM allows both old and new versions cached

Page 30: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

UTM◦ Logs blocks targeted by both loads and stores◦ More complete conflict detection◦ Must walk log on certain coherence requests

VTM◦ Per address space virtual mode for cache

evictions, paging, context switches◦ Virtualized VTM uses micro-code for conflict

detection. (LogTM uses MOESI extension)

Page 31: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo

Presents a TM implementation designed to speed up the common case

Efficiently handles cache evictions Requires simple architectural changes

◦ Registers, state, directory extension Work towards hybrid conflict detection No paging or context switch support Very sensitive to false sharing