Upload
kory-terry
View
214
Download
0
Embed Size (px)
Citation preview
Thread-Level Speculation
Karan SinghCS 612
2.23.2006
2.23.2006 CS 612 2
Introduction
extraction of parallelism at compile time is limited
TLS allows automatic parallelizationby supporting thread execution without advance knowledge of any dependence violations
Thread-Level Speculation (TLS) is a form of optimistic parallelization
2.23.2006 CS 612 3
Introduction
Zhang et al. extensions to cache coherence protocol
hardware to detect dependence violations Pickett et al.
design for a Java-specific software TLS system that operates at the bytecode level
Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors
Ye ZhangLawrence Rauchwerger
Josep Torrellas
2.23.2006 CS 612 5
Outline
Loop parallelization basics Speculative Run-Time
Parallelization in Software Speculative Run-Time
Parallelization in Hardware Evaluation and Comparison
2.23.2006 CS 612 6
Loop parallelization basics
a loop can be executed in parallel without synchronization only if the outcome is independent of the order of iterations
need to analyze data dependences across iterations: flow, anti, output
if no dependences – doall loop if only anti or output dependences –
privatization, scalar expansion …
2.23.2006 CS 612 7
Loop parallelization basics
to parallelize the loop above, we need a way of saving and restoring state a method to detect cross-iteration
dependences
2.23.2006 CS 612 8
Speculative Run-TimeParallelization in Software
mechanism for saving/restoring state before executing speculatively, we need
to save the state of the arrays that will be modified
dense access – save whole array sparse access – save individual elements save only modified shared arrays in all cases, if loop is not found parallel
after execution, arrays are restored from their backups
2.23.2006 CS 612 9
Speculative Run-TimeParallelization in Software
LRPD test to detect dependences flags existence of cross-iteration
dependences apply to those arrays whose dependences
cannot be analyzed at compile-time two phases: Marking & Analysis
2.23.2006 CS 612 10
LRPD test
setup backup A(1:s) initialize shadow arrays to zero
Ar(1:s), Aw(1:s) initialize scalar Atw to zero
2.23.2006 CS 612 11
LRPD test
marking: performed for each iteration during speculative parallel execution of the loop write to A(i): set Aw(i) read from A(i): if A(i) not written in this
iteration, set Ar(i) at end of iteration, count how many
different elements of A have been written and add count to Atw
2.23.2006 CS 612 12
LRPD test
analysis: performed after the speculative execution compute Atm = number of non-zero Aw(i)
for all elements i of the shadow array if any(Aw(:)^ Ar(:)), loop is not a doall;
abort execution else if Atw == Atm, then loop is a doall
2.23.2006 CS 612 13
Example
w(x) w(x) parallel threads
write to element x of array
Aw = 1 Aw ^ Ar = 0
Ar = 0 any(Aw ^ Ar) = 0
Atw = 2
Atm = 1
Since Atw ≠ Atm,parallelization fails
2.23.2006 CS 612 14
Example
w(x) r(x) parallel threads
write to element x of array
Aw = 1 Aw ^ Ar = 1
Ar = 1 any(Aw ^ Ar) = 1
Atw = 1
Atm = 1
Since any(Aw ^ Ar) == 1,parallelization fails
2.23.2006 CS 612 15
Example
w(x)
r(x)
parallel threads
write to element x of array
Aw = 1 Aw ^ Ar = 1
Ar = 1 any(Aw ^ Ar) = 1
Atw = 1
Atm = 1
Since any(Aw ^ Ar) == 1,parallelization fails
2.23.2006 CS 612 16
Example
w(x)
r(x)
parallel threads
write to element x of array
Aw = 1 Aw ^ Ar = 0
Ar = 0* any(Aw ^ Ar) = 0
Atw = 1
Atm = 1
Since Atw == Atm,loop is a doall
* if A(i) not written in this iteration, set Ar(i)
2.23.2006 CS 612 17
Example
2.23.2006 CS 612 18
Speculative Run-TimeParallelization in Software
implementation in a DSM system, each processor allocates a
private copy of the shadow arrays marking phase performed locally for analysis phase, private shadow arrays are
merged in parallel compiler integration
part of a front-end parallelizing compiler parallelize loops chosen based on user feedback
or heuristics about previous success rate
2.23.2006 CS 612 19
Speculative Run-TimeParallelization in Software
improvements privatization iteration-wise vs. processor-wise
shortcomings overhead of analysis phase and extra
instructions for marking we get to know parallelization failed only
after the loop completes execution
2.23.2006 CS 612 20
privatization examplefor i = 1 to N
tmp = f(i) /* f is some operation */A(i) = A(i) + tmp
enddo
in privatization, for each processor, we create private copies of the variables causing anti or output dependences
privatization
2.23.2006 CS 612 21
Speculative Run-TimeParallelization in Hardware
extend cache coherence protocol hardware of a DSM multiprocessor with extra transactions to flag any cross-iteration data dependences
on detection, parallel execution is immediately aborted
extra state in tags of all caches fast memory in the directories
2.23.2006 CS 612 22
Speculative Run-TimeParallelization in Hardware
two sets of transactions non-privatization algorithm privatization algorithm
2.23.2006 CS 612 23
non-privatization algorithm
identify as parallel those loops where each element of the array under test is either read-only or is accessed by only one processor
a pattern where an element is read by several processors and later writtenby one is flagged as not parallel
2.23.2006 CS 612 24
non-privatization algorithm
fast memory has three entries:ROnly, NoShr, First
these entries are also sent to cache and stored in tags of the corresponding cache line
per-element bits in tags of different caches and directories are kept coherent
2.23.2006 CS 612 25
non-privatization algorithm
2.23.2006 CS 612 26
Speculative Run-TimeParallelization in Hardware
implementation need three supports: storage for access
bits, logic to test and change the bits, table in the directory to find the access bits for a given physical address
modify three parts: primary cache, secondary cache, directory
2.23.2006 CS 612 27
implementation
primary cache access bits stored in
an SRAM table called Access Bit Array
algorithm operations determined by Control input
Test Logic performs operations
2.23.2006 CS 612 28
implementation
secondary cache need Access Bit Array L1 miss, L2 hit L2
provides data and access bits to L1
access bits sent directly to the test logic in L1
bits generated are stored in access bit array of L1
2.23.2006 CS 612 29
implementation
directory small dedicated
memory for access bits with lookup table
access bits generated by logic are sent to processor
transaction overlapped with memory and directory access
2.23.2006 CS 612 30
Evaluation
execution drive simulations ofCC-NUMA shared memory multiprocessor using Tango-lite
loops from applications in the Perfect Club set and one application from NCSA: Ocean, P3m, Adm, Track
compare four environments:Serial, Ideal, SW, HW
loops run with 16 processes (except Ocean which runs with 8 processes)
2.23.2006 CS 612 31
Evaluation
loop execution speedup
2.23.2006 CS 612 32
Evaluation
slowdown due to failure
2.23.2006 CS 612 33
Evaluation
scalability
2.23.2006 CS 612 34
Software vs. Hardware
in hardware, failure to parallelize isdetected on the fly
several operations are performed in hardware, which reduces overheads
hardware scheme has better scalability with number of processors
hardware scheme has less space overhead
2.23.2006 CS 612 35
Software vs. Hardware
in hardware, non-privatization testis processor-wise without requiring static scheduling
hardware scheme can be applied to pointer-based C code more efficiently
however, software implementation does not require any hardware!