CS3383 Unit 4: dynamic multithreaded algorithms€¦ · IntroductiontoParallelAlgorithms DynamicMultithreading Alsoknownasthefork-joinmodel Sharedmemory,multicore Cormenet.al3rdedition,Chapter27

CS3383 Unit 4: dynamic multithreadedalgorithms

David Bremner

March 25, 2018

Outline

Dynamic Multithreaded AlgorithmsFork-Join ModelSpan, Work, And ParallelismParallel LoopsSchedulingRace Conditions

Contents


Introduction to Parallel Algorithms

Dynamic Multithreading

▶ Also known as the fork-join model▶ Shared memory, multicore▶ Cormen et. al 3rd edition, Chapter 27

Introduction to Parallel AlgorithmsDynamic Multithreading

▶ Also known as the fork-join model▶ Shared memory, multicore▶ Cormen et. al 3rd edition, Chapter 27

Nested Parallelism▶ Spawn a subroutine, carry on with other work.▶ Similar to fork in POSIX.

Introduction to Parallel Algorithms

Nested Parallelism▶ Spawn a subroutine, carry on with other work.▶ Similar to fork in POSIX.

Parallel Loop

▶ iterations of a for loop can execute in parallel.▶ Like OpenMP

Cilk+

▶ The multithreaded model is based on Cilk+, available in thelatest versions of gcc.

▶ Programmer specifies possible paralellism▶ Runtime system takes care of mapping to OS threads▶ Cilk+ contains several more features than our model, e.g.

parallel vector and array operations.▶ Similar primitives are available in java.util.concurrent

Cilk+


▶ Programmer specifies possible paralellism

▶ Runtime system takes care of mapping to OS threads▶ Cilk+ contains several more features than our model, e.g.


Cilk+


▶ Programmer specifies possible paralellism▶ Runtime system takes care of mapping to OS threads

▶ Cilk+ contains several more features than our model, e.g.parallel vector and array operations.

▶ Similar primitives are available in java.util.concurrent

Cilk+



parallel vector and array operations.

▶ Similar primitives are available in java.util.concurrent

Cilk+




Writing parallel (pseudo)-codeKeywords

parallel Run the loop (potentially) concurrentlyspawn Run the procedure (potentially) concurrently

sync Wait for all spawned children to complete.

Serialization▶ remove keywords from parallel code yields correct serial code▶ Adding parallel keywords to correct serial code might break it

▶ missing sync▶ loop iterations not independent










▶ missing sync

▶ loop iterations not independent






Fibonacci Examplefunction Fib(𝑛)

if 𝑛 ≤ 1 thenreturn 𝑛

else𝑥 = Fib(𝑛 − 1)𝑦 = Fib(𝑛 − 2)

return 𝑥 + 𝑦end if

end function

▶ Code in C, Java, Clojure and Racket available from http://www.cs.unb.ca/~bremner/teaching/cs3383/examples

http://www.cs.unb.ca/~bremner/teaching/cs3383/examples


Fibonacci Examplefunction Fib(𝑛)

if 𝑛 ≤ 1 thenreturn 𝑛

else𝑥 = spawn Fib(𝑛 − 1)𝑦 = Fib(𝑛 − 2)syncreturn 𝑥 + 𝑦

end ifend function

▶ Code in C, Java, Clojure and Racket available from http://www.cs.unb.ca/~bremner/teaching/cs3383/examples



Contents


Computation DAGStrandsSeq. inst. with no parallel, spawn, return from spawn, or sync.

function Fib(𝑛)if 𝑛 ≤ 1 then ▷

return 𝑛else

𝑥 = spawn Fib(𝑛 − 1)𝑦 = Fib(𝑛 − 2) ▷syncreturn 𝑥 + 𝑦 ▷

end ifend function

P-FIB(1) P-FIB(0)

P-FIB(3)

P-FIB(4)

P-FIB(1)

P-FIB(1)

P-FIB(0)

P-FIB(2)

P-FIB(2)


nodes strandsdown edges spawn

up edges returnhorizontal edges sequentialcritical path longest path in DAG

span weighted length ofcritical path ≡ lowerbound on time

P-FIB(1) P-FIB(0)

P-FIB(3)

P-FIB(4)

P-FIB(1)

P-FIB(1)

P-FIB(0)

P-FIB(2)

P-FIB(2)


nodes strandsdown edges spawnup edges return

horizontal edges sequentialcritical path longest path in DAG


P-FIB(1) P-FIB(0)

P-FIB(3)

P-FIB(4)

P-FIB(1)

P-FIB(1)

P-FIB(0)

P-FIB(2)

P-FIB(2)



horizontal edges sequential

critical path longest path in DAGspan weighted length of

critical path ≡ lowerbound on time

P-FIB(1) P-FIB(0)

P-FIB(3)

P-FIB(4)

P-FIB(1)

P-FIB(1)

P-FIB(0)

P-FIB(2)

P-FIB(2)





P-FIB(1) P-FIB(0)

P-FIB(3)

P-FIB(4)

P-FIB(1)

P-FIB(1)

P-FIB(0)

P-FIB(2)

P-FIB(2)





P-FIB(1) P-FIB(0)

P-FIB(3)

P-FIB(4)

P-FIB(1)

P-FIB(1)

P-FIB(0)

P-FIB(2)

P-FIB(2)

Work and Speedup

𝑇1 Work, sequential time.

𝑇𝑝 Time on 𝑝 processors.

Work Law

𝑇𝑝 ≥ 𝑇1/𝑝speedup ∶= 𝑇1/𝑇𝑝 ≤ 𝑝

Work and Speedup

𝑇1 Work, sequential time.𝑇𝑝 Time on 𝑝 processors.

Work Law


Work and Speedup

𝑇1 Work, sequential time.𝑇𝑝 Time on 𝑝 processors.

Work Law


Parallelism

𝑇𝑝 Time on 𝑝 processors.

𝑇∞ Span, time givenunlimited processors.

We could idle processors:

𝑇𝑝 ≥ 𝑇∞ (1)

Best possible speedup:

parallelism = 𝑇1/𝑇∞≥ 𝑇1/𝑇𝑝 = speedup

Parallelism

𝑇𝑝 Time on 𝑝 processors.𝑇∞ Span, time given

unlimited processors.


𝑇𝑝 ≥ 𝑇∞ (1)



Parallelism

𝑇𝑝 Time on 𝑝 processors.𝑇∞ Span, time given

unlimited processors.


𝑇𝑝 ≥ 𝑇∞ (1)



Span and Parallelism Example

Assume strands are unit cost.▶ 𝑇1 = 17

▶ 𝑇∞ = 8▶ Parallelism = 2.125 for this

input size.

P-FIB(1) P-FIB(0)

P-FIB(3)

P-FIB(4)

P-FIB(1)

P-FIB(1)

P-FIB(0)

P-FIB(2)

P-FIB(2)


Assume strands are unit cost.▶ 𝑇1 = 17▶ 𝑇∞ = 8

▶ Parallelism = 2.125 for thisinput size.

P-FIB(1) P-FIB(0)

P-FIB(3)

P-FIB(4)

P-FIB(1)

P-FIB(1)

P-FIB(0)

P-FIB(2)

P-FIB(2)


Assume strands are unit cost.▶ 𝑇1 = 17▶ 𝑇∞ = 8▶ Parallelism = 2.125 for this

input size.P-FIB(1) P-FIB(0)

P-FIB(3)

P-FIB(4)

P-FIB(1)

P-FIB(1)

P-FIB(0)

P-FIB(2)

P-FIB(2)

Composing span and workA

A

B

B

A‖B

A +B

series 𝑇∞(𝐴 + 𝐵) = 𝑇∞(𝐴) + 𝑇∞(𝐵)

parallel 𝑇∞(𝐴‖𝐵) = max(𝑇∞(𝐴), 𝑇∞(𝐵))series or parallel 𝑇1 = 𝑇1(𝐴) + 𝑇1(𝐵)


A

B

B

A‖B

A +B

series 𝑇∞(𝐴 + 𝐵) = 𝑇∞(𝐴) + 𝑇∞(𝐵)parallel 𝑇∞(𝐴‖𝐵) = max(𝑇∞(𝐴), 𝑇∞(𝐵))

series or parallel 𝑇1 = 𝑇1(𝐴) + 𝑇1(𝐵)


A

B

B

A‖B

A +B

series 𝑇∞(𝐴 + 𝐵) = 𝑇∞(𝐴) + 𝑇∞(𝐵)parallel 𝑇∞(𝐴‖𝐵) = max(𝑇∞(𝐴), 𝑇∞(𝐵))

series or parallel 𝑇1 = 𝑇1(𝐴) + 𝑇1(𝐵)

Work of Parallel FibonacciWrite 𝑇 (𝑛) for 𝑇1 on input 𝑛.

𝑇 (𝑛) = 𝑇 (𝑛−1)+𝑇 (𝑛−2)+Θ(1)

Let 𝜙 ≈ 1.62 be the solution to

𝜙2 = 𝜙 + 1

We can show by induction (twice)that

𝑇 (𝑛) ∈ Θ(𝜙𝑛)

𝑇 (𝑛) ≤ 𝑎𝜙𝑛 − 𝑏

(I.H.)

Substitute the I.H.

𝑇 (𝑛) ≤ 𝑎(𝜙𝑛−1 + 𝜙𝑛−2) − 2𝑏 + Θ(1)

= 𝑎𝜙 + 1𝜙2 𝜙𝑛 − 𝑏 + (Θ(1) − 𝑏)

≤ 𝑎𝜙 + 1𝜙2 𝜙𝑛 − 𝑏 for 𝑏 large

= 𝑎𝜙𝑛 − 𝑏


𝑇 (𝑛) = 𝑇 (𝑛−1)+𝑇 (𝑛−2)+Θ(1)


𝜙2 = 𝜙 + 1


𝑇 (𝑛) ∈ Θ(𝜙𝑛)


(I.H.)

Substitute the I.H.

𝑇 (𝑛) ≤ 𝑎(𝜙𝑛−1 + 𝜙𝑛−2) − 2𝑏 + Θ(1)

= 𝑎𝜙 + 1𝜙2 𝜙𝑛 − 𝑏 + (Θ(1) − 𝑏)




𝑇 (𝑛) = 𝑇 (𝑛−1)+𝑇 (𝑛−2)+Θ(1)


𝜙2 = 𝜙 + 1


𝑇 (𝑛) ∈ Θ(𝜙𝑛)


(I.H.)

Substitute the I.H.

𝑇 (𝑛) ≤ 𝑎(𝜙𝑛−1 + 𝜙𝑛−2) − 2𝑏 + Θ(1)

= 𝑎𝜙 + 1𝜙2 𝜙𝑛 − 𝑏 + (Θ(1) − 𝑏)




𝑇 (𝑛) = 𝑇 (𝑛−1)+𝑇 (𝑛−2)+Θ(1)


𝜙2 = 𝜙 + 1


𝑇 (𝑛) ∈ Θ(𝜙𝑛)

𝑇 (𝑛) ≤ 𝑎𝜙𝑛 − 𝑏 (I.H.)

Substitute the I.H.

𝑇 (𝑛) ≤ 𝑎(𝜙𝑛−1 + 𝜙𝑛−2) − 2𝑏 + Θ(1)

= 𝑎𝜙 + 1𝜙2 𝜙𝑛 − 𝑏 + (Θ(1) − 𝑏)




𝑇 (𝑛) = 𝑇 (𝑛−1)+𝑇 (𝑛−2)+Θ(1)


𝜙2 = 𝜙 + 1


𝑇 (𝑛) ∈ Θ(𝜙𝑛)

𝑇 (𝑛) ≤ 𝑎𝜙𝑛 − 𝑏 (I.H.)

Substitute the I.H.

𝑇 (𝑛) ≤ 𝑎(𝜙𝑛−1 + 𝜙𝑛−2) − 2𝑏 + Θ(1)

= 𝑎𝜙 + 1𝜙2 𝜙𝑛 − 𝑏 + (Θ(1) − 𝑏)




𝑇 (𝑛) = 𝑇 (𝑛−1)+𝑇 (𝑛−2)+Θ(1)


𝜙2 = 𝜙 + 1


𝑇 (𝑛) ∈ Θ(𝜙𝑛)

𝑇 (𝑛) ≤ 𝑎𝜙𝑛 − 𝑏 (I.H.)

Substitute the I.H.

𝑇 (𝑛) ≤ 𝑎(𝜙𝑛−1 + 𝜙𝑛−2) − 2𝑏 + Θ(1)

= 𝑎𝜙 + 1𝜙2 𝜙𝑛 − 𝑏 + (Θ(1) − 𝑏)




𝑇 (𝑛) = 𝑇 (𝑛−1)+𝑇 (𝑛−2)+Θ(1)


𝜙2 = 𝜙 + 1


𝑇 (𝑛) ∈ Θ(𝜙𝑛)

𝑇 (𝑛) ≤ 𝑎𝜙𝑛 − 𝑏 (I.H.)

Substitute the I.H.

𝑇 (𝑛) ≤ 𝑎(𝜙𝑛−1 + 𝜙𝑛−2) − 2𝑏 + Θ(1)

= 𝑎𝜙 + 1𝜙2 𝜙𝑛 − 𝑏 + (Θ(1) − 𝑏)




𝑇 (𝑛) = 𝑇 (𝑛−1)+𝑇 (𝑛−2)+Θ(1)


𝜙2 = 𝜙 + 1


𝑇 (𝑛) ∈ Θ(𝜙𝑛)

𝑇 (𝑛) ≤ 𝑎𝜙𝑛 − 𝑏 (I.H.)

Substitute the I.H.

𝑇 (𝑛) ≤ 𝑎(𝜙𝑛−1 + 𝜙𝑛−2) − 2𝑏 + Θ(1)

= 𝑎𝜙 + 1𝜙2 𝜙𝑛 − 𝑏 + (Θ(1) − 𝑏)



Span and Parallelism of Fibonacci

𝑇∞(𝑛) = max(𝑇∞(𝑛 − 1), 𝑇∞(𝑛 − 2)) + Θ(1)= 𝑇∞(𝑛 − 1) + Θ(1)

Transforming to sum, we get

𝑇∞ ∈ Θ(𝑛)

parallelism = 𝑇1(𝑛)𝑇∞(𝑛)

= Θ (𝜙𝑛

𝑛)

▶ So an inefficient way to compute Fibonacci, but very parallel


𝑇∞(𝑛) = max(𝑇∞(𝑛 − 1), 𝑇∞(𝑛 − 2)) + Θ(1)= 𝑇∞(𝑛 − 1) + Θ(1)


𝑇∞ ∈ Θ(𝑛)


= Θ (𝜙𝑛

𝑛)



𝑇∞(𝑛) = max(𝑇∞(𝑛 − 1), 𝑇∞(𝑛 − 2)) + Θ(1)= 𝑇∞(𝑛 − 1) + Θ(1)


𝑇∞ ∈ Θ(𝑛)


= Θ (𝜙𝑛

𝑛)



𝑇∞(𝑛) = max(𝑇∞(𝑛 − 1), 𝑇∞(𝑛 − 2)) + Θ(1)= 𝑇∞(𝑛 − 1) + Θ(1)


𝑇∞ ∈ Θ(𝑛)


= Θ (𝜙𝑛

𝑛)


Contents


Parallel Loopsparallel for 𝑖 = 1 to 𝑛 do

statement...statement...

end for

▶ Run 𝑛 copies in parallel with local setting of 𝑖.

▶ Effectively 𝑛-way spawn▶ Can be implemented with spawn and sync▶ Span

𝑇∞(𝑛) = Θ(log 𝑛) + max𝑖

𝑇∞(iteration i)



end for

▶ Run 𝑛 copies in parallel with local setting of 𝑖.▶ Effectively 𝑛-way spawn

▶ Can be implemented with spawn and sync▶ Span





end for

▶ Run 𝑛 copies in parallel with local setting of 𝑖.▶ Effectively 𝑛-way spawn▶ Can be implemented with spawn and sync

▶ Span𝑇∞(𝑛) = Θ(log 𝑛) + max

𝑖𝑇∞(iteration i)



end for

▶ Run 𝑛 copies in parallel with local setting of 𝑖.▶ Effectively 𝑛-way spawn▶ Can be implemented with spawn and sync▶ Span



Parallel Matrix-Vector productTo compute 𝑦 = 𝐴𝑥, in parallel

𝑦𝑖 =𝑛

∑𝑗=1

𝑎𝑖𝑗𝑥𝑗

function RowMult(A,x,y,i)𝑦𝑖 = 0for 𝑗 = 1 to 𝑛 do

𝑦𝑖 = 𝑦𝑖 + 𝑎𝑖𝑗𝑥𝑗end for

end function

function Mat-Vec(𝐴, 𝑥, 𝑦)Let 𝑛 = rows(𝐴)parallel for 𝑖 = 1 to 𝑛 do

RowMult(A,x,y,i)end for

end function

𝑇1(𝑛) ∈ Θ(𝑛2) (serialization)𝑇∞(𝑛) = Θ(log(𝑛))⏟⏟⏟⏟⏟

parallel for

+ Θ(𝑛)⏟RowMult


𝑦𝑖 =𝑛

∑𝑗=1




end function



end function


parallel for



𝑦𝑖 =𝑛

∑𝑗=1




end function



end function


parallel for



𝑦𝑖 =𝑛

∑𝑗=1




end function



end function


parallel for


Parallel Matrix-Vector product



end function



end function

▶ Why is RowMult not usingparallel for?

Parallel Matrix-Vector product



end function



end function

▶ Why is RowMult not usingparallel for?

Divide and Conquer Matrix-Vector product

function MVDC(𝐴, 𝑥, 𝑦, 𝑓, 𝑡)if 𝑓 == 𝑡 then

RowMult(A,x,y,f)else

𝑚 = ⌊(𝑓 + 𝑡)/2⌋spawn MVDC(𝐴, 𝑥, 𝑦, 𝑓, 𝑚)MVDC(𝐴, 𝑥, 𝑦, 𝑚 + 1, 𝑡)sync

end ifend function

1,1 2,2 3,3 4,4 5,5 6,6 7,7 8,8

1,2 3,4 5,6 7,8

1,4 5,8

1,8





end ifend function

▶ 𝑇∞(𝑛) = Θ(log 𝑛)(binary tree)

▶ Θ(𝑛) leaves (one perrow)

▶ Θ(𝑛) interior nodes(binary tree)

▶ 𝑇1(𝑛) = Θ(𝑛2)





end ifend function




▶ 𝑇1(𝑛) = Θ(𝑛2)





end ifend function




▶ 𝑇1(𝑛) = Θ(𝑛2)





end ifend function




▶ 𝑇1(𝑛) = Θ(𝑛2)

Contents


SchedulingScheduling Problem

Abstractly Mapping threads to processorsPragmatically Mapping logical threads to a thread pool.

Ideal SchedulerOn-Line No advance knowledge of when threads will spawn or

complete.Distributed No central controller.

▶ to simplify analysis, we relax the second condition











A greedy centralized schedulerMaintain a ready queue of strands ready to run.

Scheduling Step

Complete Step If ≥ 𝑝 (# processors) strands are ready, assign 𝑝strands to processors.

Incomplete Step Otherwise, assign all waiting strands to processors

▶ To simplify analysis, split any non-unit strands into a chain ofunit strands

▶ Therefore, after one time step, we schedule again.


Scheduling Step






Scheduling Step






Scheduling Step





Optimal and Approximate SchedulingRecall

𝑇𝑝 ≥ 𝑇1/𝑝 (work law)𝑇𝑝 ≥ 𝑇∞ (span)

Therefore

𝑇𝑝 ≥ max(𝑇1/𝑝, 𝑇∞) = opt

With the greedy algorithm we can achieve

𝑇𝑝 ≤ 𝑇1𝑝

+ 𝑇∞ ≤ 2 max(𝑇1/𝑝, 𝑇∞) = 2 × opt

Optimal and Approximate SchedulingRecall

𝑇𝑝 ≥ 𝑇1/𝑝 (work law)𝑇𝑝 ≥ 𝑇∞ (span)

Therefore

𝑇𝑝 ≥ max(𝑇1/𝑝, 𝑇∞) = opt

With the greedy algorithm we can achieve


+ 𝑇∞ ≤ 2 max(𝑇1/𝑝, 𝑇∞) = 2 × opt

Counting Complete Steps

▶ Let 𝑘 be the number of complete steps.

▶ At each complete step we do 𝑝 units of work.▶ Every unit of work corresponds to one step of the serialization,

so 𝑘𝑝 ≤ 𝑇1.▶ Therefore 𝑘 ≤ 𝑇1/𝑝


▶ Let 𝑘 be the number of complete steps.▶ At each complete step we do 𝑝 units of work.

▶ Every unit of work corresponds to one step of the serialization,so 𝑘𝑝 ≤ 𝑇1.

▶ Therefore 𝑘 ≤ 𝑇1/𝑝


▶ Let 𝑘 be the number of complete steps.▶ At each complete step we do 𝑝 units of work.▶ Every unit of work corresponds to one step of the serialization,

so 𝑘𝑝 ≤ 𝑇1.

▶ Therefore 𝑘 ≤ 𝑇1/𝑝


▶ Let 𝑘 be the number of complete steps.▶ At each complete step we do 𝑝 units of work.▶ Every unit of work corresponds to one step of the serialization,

so 𝑘𝑝 ≤ 𝑇1.▶ Therefore 𝑘 ≤ 𝑇1/𝑝

Counting Incomplete Steps▶ Let 𝐺 be the DAG of remaining

strands.

▶ The ready queue of strands isexactly the set of sources in 𝐺

▶ In incomplete step runs all sourcesin 𝐺

▶ Every longest path starts at asource (otherwise, extend)

▶ After an incomplete step, length oflongest path shrinks by 1

▶ There can be at most 𝑇∞ steps.


strands.▶ The ready queue of strands is

exactly the set of sources in 𝐺

▶ In incomplete step runs all sourcesin 𝐺






exactly the set of sources in 𝐺▶ In incomplete step runs all sources

in 𝐺







in 𝐺▶ Every longest path starts at a

source (otherwise, extend)







source (otherwise, extend)▶ After an incomplete step, length of

longest path shrinks by 1






source (otherwise, extend)▶ After an incomplete step, length of

longest path shrinks by 1▶ There can be at most 𝑇∞ steps.

Parallel Slackness

parallel slackness = parallelism𝑝

= 𝑇1𝑝𝑇∞

speedup = 𝑇1𝑇𝑝

≤ 𝑇1𝑇∞

= 𝑝 × slackness

▶ If slackness < 1, speedup < 𝑝

▶ If slackness ≥ 1, linear speedup achievable for given number ofprocessors

Parallel Slackness

parallel slackness = parallelism𝑝

= 𝑇1𝑝𝑇∞

speedup = 𝑇1𝑇𝑝

≤ 𝑇1𝑇∞

= 𝑝 × slackness

▶ If slackness < 1, speedup < 𝑝▶ If slackness ≥ 1, linear speedup achievable for given number of

processors

Slackness and Schedulingslackness ∶= 𝑇1

𝑝 × 𝑇∞

TheoremFor sufficiently large slackness,the greed scheduler approachestime 𝑇1/𝑝.

Suppose

𝑇1𝑝 × 𝑇∞

≥ 𝑐

Then𝑇∞ ≤ 𝑇1

𝑐𝑝(2)

Recall that with the greedyscheduler,

𝑇𝑝 ≤ (𝑇1𝑝

+ 𝑇∞)

Substituting (2), we have


(1 + 1𝑐)


𝑝 × 𝑇∞


Suppose

𝑇1𝑝 × 𝑇∞

≥ 𝑐


𝑐𝑝(2)



+ 𝑇∞)



(1 + 1𝑐)


𝑝 × 𝑇∞


Suppose

𝑇1𝑝 × 𝑇∞

≥ 𝑐


𝑐𝑝(2)



+ 𝑇∞)



(1 + 1𝑐)


𝑝 × 𝑇∞


Suppose

𝑇1𝑝 × 𝑇∞

≥ 𝑐


𝑐𝑝(2)



+ 𝑇∞)



(1 + 1𝑐)


𝑝 × 𝑇∞


Suppose

𝑇1𝑝 × 𝑇∞

≥ 𝑐


𝑐𝑝(2)



+ 𝑇∞)



(1 + 1𝑐)

Contents


Race ConditionsNon-Determinism

▶ result varies from run to run▶ sometimes OK (in certain randomized algorithms)▶ mostly a bug.

Example

x = 0parallel for i ← 1 to 2 do

x ← x + 1

▶ This is nondeterministic unless incrementing x is atomic

Race ConditionsNon-Determinism

▶ result varies from run to run▶ sometimes OK (in certain randomized algorithms)▶ mostly a bug.

Example

x = 0parallel for i ← 1 to 2 do

x ← x + 1

▶ This is nondeterministic unless incrementing x is atomic

Racy execution

𝑥 = 0

𝑟1 ← 𝑥𝑟2 ← 𝑥

incr 𝑟1incr 𝑟2

𝑥 ← 𝑟1𝑥 ← 𝑟2

print 𝑥

▶ all possible topological sorts arevalid execution orders

▶ In particular it’s not hard for bothloads to complete before eitherstore

▶ In practice there are varioussynchronization strategies (locks,etc…).

▶ Here we will insist that parallelstrands are independent

Racy execution

𝑥 = 0

𝑟1 ← 𝑥𝑟2 ← 𝑥


𝑥 ← 𝑟1𝑥 ← 𝑟2

print 𝑥





Racy execution

𝑥 = 0

𝑟1 ← 𝑥𝑟2 ← 𝑥


𝑥 ← 𝑟1𝑥 ← 𝑟2

print 𝑥





Racy execution

𝑥 = 0

𝑟1 ← 𝑥𝑟2 ← 𝑥


𝑥 ← 𝑟1𝑥 ← 𝑟2

print 𝑥





We can write bad code with spawn too

sum(i, j)if (i>j)

return;if (i==j)

x++;else

m=(i+j)/2;spawn sum(i,m);sum(m+1,j);sync;

▶ here we have the samenon-deterministic interleaving ofreading and writing 𝑥

▶ the style is a bit unnatural, inparticular we are not using thereturn value of spawn at all.

Being more functional helps

sum(i, j)if (i>j) return 0;if (i==j) return 1;

m ← (i+j)/2;

left ← spawn sum(i,m);right ← sum(m+1,j);sync;return left + right;

▶ each strand writes intodifferent variables

▶ sync is used as a barrier toserialize

Being more functional helps

sum(i, j)if (i>j) return 0;if (i==j) return 1;

m ← (i+j)/2;

left ← spawn sum(i,m);right ← sum(m+1,j);sync;return left + right;

▶ each strand writes intodifferent variables

▶ sync is used as a barrier toserialize

Single Writer races

x ← spawn foo(x)y ← foo(x)sync

▶ arguments to spawnedroutines are evaluated in theparent context

▶ but this isn’t enough to berace free.

▶ which value 𝑥 is passed tothe second call of ’foo’depends how long the firstone takes.

Single Writer races





Single Writer races





Documents

CS3383 Unit 4: dynamic multithreaded algorithms€¦ · IntroductiontoParallelAlgorithms DynamicMultithreading Alsoknownasthefork-joinmodel Sharedmemory,multicore Cormenet.al3rdedition,Chapter27