A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia...

A Scalable Approach to Thread-Level Speculation

J. Gregory Steffan, Christopher B. Colohan,

Antonia Zhai, and Todd C. Mowry

Carnegie Mellon University

Outline Motivation Thread level speculation (TLS) Coherence scheme Optimizations Methodology Results Conclusion

Motivation Leading chip manufactures going for multi-

core architectures Usually used to increase throughput To exploit these parallel resources to increase

performance – need to parallelize programs Integer programs hard to parallelize Use speculation – thread level speculation

(TLS)!

Thread level speculation (TLS)

Scalable Approach The paper aims to design a scalable approach

which applies to wide variety of multi-processor like architectures

Only limitation is that the architecture should be shared memory based

The TLS is implemented over the invalidation based cache coherence protocol

Example Each cache line has special bits

SL – speculative load has accessed the line SM – the line is speculatively modified

Thread is squashed if Line is present SL is set If epoch number indicates an earlier thread

Speculation level We are concerned only

with the speculation level – level in the cache hierarchy where the cache protocol begins

We can ignore all the other levels

Cache line states Apart from the cache

state bits we need SL and SM bits

A cache line with speculative bits set cannot be replaced

The thread is either squashed or the operation is delayed

Basic cache coherence protocol When a processor wants to load a value, it

atleast needs shared access to the line When it wants to write, it needs exclusive

access Coherence mechanism issues invalidation

message when it receives request for exclusive access

Coherence mechanism

Commit When the homefree token arrives there is no

possibility of further squashes SpE is changed to E and SpS to S Lines with SM bit set has to have D bit set If a line is speculatively modified and shared,

we have to get exclusive access for that line Ownership required buffer (ORB) is used to track

such lines

Squash All speculatively modified lines have to be

invalidated SpE is changed to E and SpS to S

Performance Optimizations

Forwarding Data Between Epochs: Predictable data dependences are synchronized

Dirty and Speculatively Loaded State: Usually if a dirty line is speculatively loaded, it is

flushed – this can be avoided Suspending Violations:

When we have to evict a speculative line, we don’t need to squash

Multiple writers If two epochs write to the same line – we

have to squash one to avoid multiple writer problem

Possible to avoid this by maintaining fine grained disambiguation bits

Implementation

Epoch numbers Has two parts – TID and sequence number To avoid costly comparisons during every

access – the difference is precomputed and a logically later mask is formed

Epoch numbers are maintained at one place for one chip

Speculative state implementation

Multiple writers - implementation False violations are also handled in the same

Correctness considerations Speculation fails if the speculative state is lost Exceptions are handled only when the

homefree token is got System calls are also postponed

Methodology Detailed out-of-order simulation based on

MIPS R10000 is done Fork and other synchronization overhead is

10 cycles

Results Normalized execution cycles

Results Buk and equake – memory performance is a

bottleneck When increased more than 4 processors ijpeg

performance degrades Number of threads available is less Some conflicts in cache

Overheads Violations

Cache locality is important ORB size can be further reduced – early release of

Communication overhead Buk is insensitive

Multiprocessor performance Advantages

More cache storage Disadvantage

Increased communication latency

Conclusion By using TLS even integer programs can be

parallelized to get speedup The approach is scalable and can be applied

to various other architectures which support multiple threads

There are applications that are insensitive to communication latency – so large scale parallel architectures using TLS are possible

Thanks!

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia...

Documents

Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science

KM C224e-20180320095656 - ACGOV.org · 39400 Paseo Padre Parkway Fremont, CA 510-248-3143 Mon-Fri 9am-5pm Mowry Medical Pharmacy (Controlled Substances accepted) 1999 Mowry Avenue,

Expanding Your Affiliate Activity into Europe - Silke Steffan

Advanced jQuery and Lasso Integration By Steffan Cline steffan@execuchoice.net

Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 3, 2000

Vera Mowry Roberts Papers - library.hunter.cuny.edu BIOGRAPHICAL SKETCH Vera Mowry (Roberts) was born in Pittsburgh, Pennsylvania on October 21, 1913, the second daughter of Joseph

Parallel Programming Todd C. Mowry CS 740 October 16 & 18, 2000

MySQL Data Warehousing Survival Guide Marius Moscovici (marius@metricinsights.com) Steffan Mejia (steffan@lindenlab.com)

SLOCsloctheater.org/wp-content/uploads/2019/04/2019-Addams-Family-Pl… · Elizabeth Mowry (In Memory of John Mowry) (11) Mr. & Mrs. Robert C. Farquharson (41) Dolores Fragomeni (53)

Synchronization Todd C. Mowry CS 740 November 1, 2000

Tolerating Dependences Between Large Speculative Threads Via Sub-Threads Chris Colohan 1,2, Anastassia Ailamaki 2, J. Gregory Steffan 3 and Todd C. Mowry

How to Compute Like a Grad Student Chris Colohan colohan+@cs.cmu.edu

Development of BIOL 305 General Microbiology Joshua J. Steffan Dickinson State University

Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory

CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

WSNs in Harbour -MPAC23-Steffan

Mowry Landing Mowry Avenue & Blacow Road Fremont, CA Features Location:

Parallelism: Memory Consistency Models and Scheduling Todd C. Mowry & Dave Eckhardt

Peter E. Colohan United States Group on Earth Observations April 21,2010 1

1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia