35
Thread-Level Speculation Karan Singh CS 612 2.23.2006

Thread-Level Speculation Karan Singh CS 612 2.23.2006

Embed Size (px)

Citation preview

Page 1: Thread-Level Speculation Karan Singh CS 612 2.23.2006

Thread-Level Speculation

Karan SinghCS 612

2.23.2006

Page 2: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 2

Introduction

extraction of parallelism at compile time is limited

TLS allows automatic parallelizationby supporting thread execution without advance knowledge of any dependence violations

Thread-Level Speculation (TLS) is a form of optimistic parallelization

Page 3: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 3

Introduction

Zhang et al. extensions to cache coherence protocol

hardware to detect dependence violations Pickett et al.

design for a Java-specific software TLS system that operates at the bytecode level

Page 4: Thread-Level Speculation Karan Singh CS 612 2.23.2006

Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors

Ye ZhangLawrence Rauchwerger

Josep Torrellas

Page 5: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 5

Outline

Loop parallelization basics Speculative Run-Time

Parallelization in Software Speculative Run-Time

Parallelization in Hardware Evaluation and Comparison

Page 6: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 6

Loop parallelization basics

a loop can be executed in parallel without synchronization only if the outcome is independent of the order of iterations

need to analyze data dependences across iterations: flow, anti, output

if no dependences – doall loop if only anti or output dependences –

privatization, scalar expansion …

Page 7: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 7

Loop parallelization basics

to parallelize the loop above, we need a way of saving and restoring state a method to detect cross-iteration

dependences

Page 8: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 8

Speculative Run-TimeParallelization in Software

mechanism for saving/restoring state before executing speculatively, we need

to save the state of the arrays that will be modified

dense access – save whole array sparse access – save individual elements save only modified shared arrays in all cases, if loop is not found parallel

after execution, arrays are restored from their backups

Page 9: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 9

Speculative Run-TimeParallelization in Software

LRPD test to detect dependences flags existence of cross-iteration

dependences apply to those arrays whose dependences

cannot be analyzed at compile-time two phases: Marking & Analysis

Page 10: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 10

LRPD test

setup backup A(1:s) initialize shadow arrays to zero

Ar(1:s), Aw(1:s) initialize scalar Atw to zero

Page 11: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 11

LRPD test

marking: performed for each iteration during speculative parallel execution of the loop write to A(i): set Aw(i) read from A(i): if A(i) not written in this

iteration, set Ar(i) at end of iteration, count how many

different elements of A have been written and add count to Atw

Page 12: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 12

LRPD test

analysis: performed after the speculative execution compute Atm = number of non-zero Aw(i)

for all elements i of the shadow array if any(Aw(:)^ Ar(:)), loop is not a doall;

abort execution else if Atw == Atm, then loop is a doall

Page 13: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 13

Example

w(x) w(x) parallel threads

write to element x of array

Aw = 1 Aw ^ Ar = 0

Ar = 0 any(Aw ^ Ar) = 0

Atw = 2

Atm = 1

Since Atw ≠ Atm,parallelization fails

Page 14: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 14

Example

w(x) r(x) parallel threads

write to element x of array

Aw = 1 Aw ^ Ar = 1

Ar = 1 any(Aw ^ Ar) = 1

Atw = 1

Atm = 1

Since any(Aw ^ Ar) == 1,parallelization fails

Page 15: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 15

Example

w(x)

r(x)

parallel threads

write to element x of array

Aw = 1 Aw ^ Ar = 1

Ar = 1 any(Aw ^ Ar) = 1

Atw = 1

Atm = 1

Since any(Aw ^ Ar) == 1,parallelization fails

Page 16: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 16

Example

w(x)

r(x)

parallel threads

write to element x of array

Aw = 1 Aw ^ Ar = 0

Ar = 0* any(Aw ^ Ar) = 0

Atw = 1

Atm = 1

Since Atw == Atm,loop is a doall

* if A(i) not written in this iteration, set Ar(i)

Page 17: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 17

Example

Page 18: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 18

Speculative Run-TimeParallelization in Software

implementation in a DSM system, each processor allocates a

private copy of the shadow arrays marking phase performed locally for analysis phase, private shadow arrays are

merged in parallel compiler integration

part of a front-end parallelizing compiler parallelize loops chosen based on user feedback

or heuristics about previous success rate

Page 19: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 19

Speculative Run-TimeParallelization in Software

improvements privatization iteration-wise vs. processor-wise

shortcomings overhead of analysis phase and extra

instructions for marking we get to know parallelization failed only

after the loop completes execution

Page 20: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 20

privatization examplefor i = 1 to N

tmp = f(i) /* f is some operation */A(i) = A(i) + tmp

enddo

in privatization, for each processor, we create private copies of the variables causing anti or output dependences

privatization

Page 21: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 21

Speculative Run-TimeParallelization in Hardware

extend cache coherence protocol hardware of a DSM multiprocessor with extra transactions to flag any cross-iteration data dependences

on detection, parallel execution is immediately aborted

extra state in tags of all caches fast memory in the directories

Page 22: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 22

Speculative Run-TimeParallelization in Hardware

two sets of transactions non-privatization algorithm privatization algorithm

Page 23: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 23

non-privatization algorithm

identify as parallel those loops where each element of the array under test is either read-only or is accessed by only one processor

a pattern where an element is read by several processors and later writtenby one is flagged as not parallel

Page 24: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 24

non-privatization algorithm

fast memory has three entries:ROnly, NoShr, First

these entries are also sent to cache and stored in tags of the corresponding cache line

per-element bits in tags of different caches and directories are kept coherent

Page 25: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 25

non-privatization algorithm

Page 26: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 26

Speculative Run-TimeParallelization in Hardware

implementation need three supports: storage for access

bits, logic to test and change the bits, table in the directory to find the access bits for a given physical address

modify three parts: primary cache, secondary cache, directory

Page 27: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 27

implementation

primary cache access bits stored in

an SRAM table called Access Bit Array

algorithm operations determined by Control input

Test Logic performs operations

Page 28: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 28

implementation

secondary cache need Access Bit Array L1 miss, L2 hit L2

provides data and access bits to L1

access bits sent directly to the test logic in L1

bits generated are stored in access bit array of L1

Page 29: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 29

implementation

directory small dedicated

memory for access bits with lookup table

access bits generated by logic are sent to processor

transaction overlapped with memory and directory access

Page 30: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 30

Evaluation

execution drive simulations ofCC-NUMA shared memory multiprocessor using Tango-lite

loops from applications in the Perfect Club set and one application from NCSA: Ocean, P3m, Adm, Track

compare four environments:Serial, Ideal, SW, HW

loops run with 16 processes (except Ocean which runs with 8 processes)

Page 31: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 31

Evaluation

loop execution speedup

Page 32: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 32

Evaluation

slowdown due to failure

Page 33: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 33

Evaluation

scalability

Page 34: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 34

Software vs. Hardware

in hardware, failure to parallelize isdetected on the fly

several operations are performed in hardware, which reduces overheads

hardware scheme has better scalability with number of processors

hardware scheme has less space overhead

Page 35: Thread-Level Speculation Karan Singh CS 612 2.23.2006

2.23.2006 CS 612 35

Software vs. Hardware

in hardware, non-privatization testis processor-wise without requiring static scheduling

hardware scheme can be applied to pointer-based C code more efficiently

however, software implementation does not require any hardware!