Thread-Level Speculation Karan Singh CS 612 2.23.2006

Thread-Level Speculation

Karan SinghCS 612

2.23.2006

2.23.2006 CS 612 2

Introduction

extraction of parallelism at compile time is limited

TLS allows automatic parallelizationby supporting thread execution without advance knowledge of any dependence violations

Thread-Level Speculation (TLS) is a form of optimistic parallelization

2.23.2006 CS 612 3

Introduction

Zhang et al. extensions to cache coherence protocol

hardware to detect dependence violations Pickett et al.

design for a Java-specific software TLS system that operates at the bytecode level

Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors

Ye ZhangLawrence Rauchwerger

Josep Torrellas

2.23.2006 CS 612 5

Outline

Loop parallelization basics Speculative Run-Time

Parallelization in Software Speculative Run-Time

Parallelization in Hardware Evaluation and Comparison

2.23.2006 CS 612 6

Loop parallelization basics

a loop can be executed in parallel without synchronization only if the outcome is independent of the order of iterations

need to analyze data dependences across iterations: flow, anti, output

if no dependences – doall loop if only anti or output dependences –

privatization, scalar expansion …

2.23.2006 CS 612 7

Loop parallelization basics

to parallelize the loop above, we need a way of saving and restoring state a method to detect cross-iteration

dependences

2.23.2006 CS 612 8

Speculative Run-TimeParallelization in Software

mechanism for saving/restoring state before executing speculatively, we need

to save the state of the arrays that will be modified

dense access – save whole array sparse access – save individual elements save only modified shared arrays in all cases, if loop is not found parallel

after execution, arrays are restored from their backups

2.23.2006 CS 612 9


LRPD test to detect dependences flags existence of cross-iteration

dependences apply to those arrays whose dependences

cannot be analyzed at compile-time two phases: Marking & Analysis

2.23.2006 CS 612 10

LRPD test

setup backup A(1:s) initialize shadow arrays to zero

Ar(1:s), Aw(1:s) initialize scalar Atw to zero

2.23.2006 CS 612 11

LRPD test

marking: performed for each iteration during speculative parallel execution of the loop write to A(i): set Aw(i) read from A(i): if A(i) not written in this

iteration, set Ar(i) at end of iteration, count how many

different elements of A have been written and add count to Atw

2.23.2006 CS 612 12

LRPD test

analysis: performed after the speculative execution compute Atm = number of non-zero Aw(i)

for all elements i of the shadow array if any(Aw(:)^ Ar(:)), loop is not a doall;

abort execution else if Atw == Atm, then loop is a doall

2.23.2006 CS 612 13

Example

w(x) w(x) parallel threads

write to element x of array

Aw = 1 Aw ^ Ar = 0

Ar = 0 any(Aw ^ Ar) = 0

Atw = 2

Atm = 1

Since Atw ≠ Atm,parallelization fails

2.23.2006 CS 612 14

Example

w(x) r(x) parallel threads


Aw = 1 Aw ^ Ar = 1

Ar = 1 any(Aw ^ Ar) = 1

Atw = 1

Atm = 1

Since any(Aw ^ Ar) == 1,parallelization fails

2.23.2006 CS 612 15

Example

w(x)

r(x)

parallel threads


Aw = 1 Aw ^ Ar = 1

Ar = 1 any(Aw ^ Ar) = 1

Atw = 1

Atm = 1

Since any(Aw ^ Ar) == 1,parallelization fails

2.23.2006 CS 612 16

Example

w(x)

r(x)

parallel threads


Aw = 1 Aw ^ Ar = 0

Ar = 0* any(Aw ^ Ar) = 0

Atw = 1

Atm = 1

Since Atw == Atm,loop is a doall

* if A(i) not written in this iteration, set Ar(i)

2.23.2006 CS 612 17

Example

2.23.2006 CS 612 18


implementation in a DSM system, each processor allocates a

private copy of the shadow arrays marking phase performed locally for analysis phase, private shadow arrays are

merged in parallel compiler integration

part of a front-end parallelizing compiler parallelize loops chosen based on user feedback

or heuristics about previous success rate

2.23.2006 CS 612 19


improvements privatization iteration-wise vs. processor-wise

shortcomings overhead of analysis phase and extra

instructions for marking we get to know parallelization failed only

after the loop completes execution

2.23.2006 CS 612 20

privatization examplefor i = 1 to N

tmp = f(i) /* f is some operation */A(i) = A(i) + tmp

enddo

in privatization, for each processor, we create private copies of the variables causing anti or output dependences

privatization

2.23.2006 CS 612 21

Speculative Run-TimeParallelization in Hardware

extend cache coherence protocol hardware of a DSM multiprocessor with extra transactions to flag any cross-iteration data dependences

on detection, parallel execution is immediately aborted

extra state in tags of all caches fast memory in the directories

2.23.2006 CS 612 22


two sets of transactions non-privatization algorithm privatization algorithm

2.23.2006 CS 612 23

non-privatization algorithm

identify as parallel those loops where each element of the array under test is either read-only or is accessed by only one processor

a pattern where an element is read by several processors and later writtenby one is flagged as not parallel

2.23.2006 CS 612 24


fast memory has three entries:ROnly, NoShr, First

these entries are also sent to cache and stored in tags of the corresponding cache line

per-element bits in tags of different caches and directories are kept coherent

2.23.2006 CS 612 25


2.23.2006 CS 612 26


implementation need three supports: storage for access

bits, logic to test and change the bits, table in the directory to find the access bits for a given physical address

modify three parts: primary cache, secondary cache, directory

2.23.2006 CS 612 27

implementation

primary cache access bits stored in

an SRAM table called Access Bit Array

algorithm operations determined by Control input

Test Logic performs operations

2.23.2006 CS 612 28

implementation

secondary cache need Access Bit Array L1 miss, L2 hit L2

provides data and access bits to L1

access bits sent directly to the test logic in L1

bits generated are stored in access bit array of L1

2.23.2006 CS 612 29

implementation

directory small dedicated

memory for access bits with lookup table

access bits generated by logic are sent to processor

transaction overlapped with memory and directory access

2.23.2006 CS 612 30

Evaluation

execution drive simulations ofCC-NUMA shared memory multiprocessor using Tango-lite

loops from applications in the Perfect Club set and one application from NCSA: Ocean, P3m, Adm, Track

compare four environments:Serial, Ideal, SW, HW

loops run with 16 processes (except Ocean which runs with 8 processes)

2.23.2006 CS 612 31

Evaluation

loop execution speedup

2.23.2006 CS 612 32

Evaluation

slowdown due to failure

2.23.2006 CS 612 33

Evaluation

scalability

2.23.2006 CS 612 34

Software vs. Hardware

in hardware, failure to parallelize isdetected on the fly

several operations are performed in hardware, which reduces overheads

hardware scheme has better scalability with number of processors

hardware scheme has less space overhead

2.23.2006 CS 612 35

Software vs. Hardware

in hardware, non-privatization testis processor-wise without requiring static scheduling

hardware scheme can be applied to pointer-based C code more efficiently

however, software implementation does not require any hardware!

Documents

Thread-Level Speculation Karan Singh CS 612 2.23.2006