44
Distributed Algorithms (22903) Lecturer: Danny Hendler Shared objects: linearizability, wait-freedom and simulations Most of this presentation is based on the book “Distributed Computing” by Hagit attiya & Jennifer Welch. Some slides are based on presentations of Nir Shavit.

Distributed Algorithms (22903)

  • Upload
    erv

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Distributed Algorithms (22903). Shared objects: linearizability, wait-freedom and simulations. Lecturer: Danny Hendler. Most of this presentation is based on the book “Distributed Computing” by Hagit attiya & Jennifer Welch. Some slides are based on presentations of Nir Shavit. - PowerPoint PPT Presentation

Citation preview

Page 1: Distributed Algorithms  (22903)

DistributedAlgorithms

(22903)

Lecturer: Danny Hendler

Shared objects: linearizability, wait-freedom and simulations

Most of this presentation is based on the book “Distributed Computing” by Hagit attiya & Jennifer Welch.

Some slides are based on presentations of Nir Shavit.

Page 2: Distributed Algorithms  (22903)

210

Back to shared memory:Shared Objects

memory

object object

Page 3: Distributed Algorithms  (22903)

3

Shared Objects (cont’d)

• Each object has a state– Usually given by a set of shared memory

fields

• Objects may be implemented from simpler base objects.

• Each object supports a set ofoperations– Only way to manipulate state– E.g. – a shared counter supports the

fetch&increment operation.

Page 4: Distributed Algorithms  (22903)

4

Shared Objects Correctness

Correctness of a sequential counter

• fetch&increment, applied to a counter with value v, returns v and increments the counter’s value to (v+1).

• Values returned by consecutive operations:

0, 1, 2, …But how do we define the correctness

of a shared counter?

Page 5: Distributed Algorithms  (22903)

5

time

q.enq(x)

q.enq(y)

q.deq(x)

q.deq(y)

fetch&inc

fetch&inc

fetch&inc

fetch&inc

time

Shared Objects Correctness (cont’d)

There is only a partial order between operations!

Invocation Response

Page 6: Distributed Algorithms  (22903)

6

Shared Objects Correctness (cont’d)

An invocation calls an operation on an object.

c.f&I ()

object

method

arguments

Page 7: Distributed Algorithms  (22903)

7

Shared Objects Correctness (cont’d)

An object returns the response of the operation.

c: 12

object

response

Page 8: Distributed Algorithms  (22903)

8

Shared Objects Correctness (cont’d)

A sequential object history is a sequence of matching invocations and responses on the object.

Example: a sequential history of a queue

q.enq(3)q:voidq.enq(7)q:voidq.deq()q:3q.deq()q:7

Page 9: Distributed Algorithms  (22903)

9

Shared Objects Correctness (cont’d)

Sequential specification

The correct behavior of the object in the absence of concurrency. A set of legal sequential object histories.

Example: the sequential spec of a counter

H0: H1: c.f&i() c:0H2: c.f&i() c:0 c.f&i() c:1 H3: c.f&i() c:0 c.f&i() c:1 c.f&i() c:2H4: c.f&i() c:0 c.f&i() c:1 c.f&i() c:2 c.f&i() c:3

.

.

.

Page 10: Distributed Algorithms  (22903)

10

Shared Objects Correctness (cont’d)

Linearizability

An execution is linearizable if there exists a permutation of the operations on each object o, , such that

• is a sequential history of o

• preserves the partial order of the execution.

Page 11: Distributed Algorithms  (22903)

11

Example

time

q.enq(x)

q.enq(y) q.deq(x)

q.deq(y)

linearizableq.enq(x)

q.enq(y) q.deq(x)

q.deq(y)

time

(6)

Page 12: Distributed Algorithms  (22903)

12

Example

time

q.enq(x)

q.enq(y)

q.deq(y)

not

linearizableq.enq(x)

q.enq(y)

(5)

Page 13: Distributed Algorithms  (22903)

13

Example

time

q.enq(x)

q.deq(x)

q.enq(x)

q.deq(x)

linearizable

time

(4)

Page 14: Distributed Algorithms  (22903)

14

Example

time

q.enq(x)

q.enq(y)

q.deq(y)

linearizable

q.deq(x)

time

q.enq(x)

q.enq(y)

q.deq(y)

q.deq(x)

q.enq(x)

q.enq(y)

q.deq(y)

q.deq(x)

multiple orders

OK

(8)

Page 15: Distributed Algorithms  (22903)

15

Wait freedomWait-freedom

An algorithm is wait-free if every operation terminates after performing some finite number of events.Wait-freedom implies that there is no use of locks (no mutual exclusion).

Thus the problems inherent to locks are avoided:

• Deadlock

• Priority inversion

Page 16: Distributed Algorithms  (22903)

16

Wait-free linearizable implementations

Example: the sequential spec of a register

H0: H1: r.read() r:initH2: r.write(v1) r:ack H3: r.write(v1) r:ack r.read() r:v1 r.read() r:v1 H4: r.write(v1) r:ack r.write(v2) r:ack r.read() r:v2 ...

Read returns the value written by last Write (or init value if there were no preceding writes)

Page 17: Distributed Algorithms  (22903)

17

Wait-free (linearizable) register simulations

Binary single-reader/single-writer register

(Multi-valued) single-reader/single-writer register

multi-reader/single-writer register

multi-reader/multi-writer register

Page 18: Distributed Algorithms  (22903)

18

A wait-free (linearizable) implementation of a single-writer-single-reader (SRSW) multi-valued register from binary SRSW registers

Would the above implementation of a k-valued register (initialized to i) work?

Initially B[0]…B[k-1]=0, B[i]=1 (i is the initial value of R)Read(R)

Return the index of the single entry of B that equals 1

Write(R, v) Write 1 to B[v], clear the entry corresponding to the

previous value (if other than v).

No!

Page 19: Distributed Algorithms  (22903)

19

An example of a non-linearizable execution

Initially B[0]…B[2]=0, B[3]=1

Read

Write(1) Write(2)

Read B[0]

Return 0

Read B[1]

Return 0

Write 1 to B[1]

Ack Write 0to B[3]

Ack Write 1 to B[2]

Ack

Read B[2]

Return1

Ack

Return 2 Read

Read B[0]

Return 0

Read B[1]

Return 1

Write 0 to B[1]

Ack

Ack

Return 1

= linearization point

Write(1) precedes Write(2) ANDRead(2) precedes Read(1).

This is not linearizable!

Page 20: Distributed Algorithms  (22903)

20

A Wait-free Linearizable ImplementationInitially B[v]=1 and all other entries equal 0, where v is the initial value of R.

Read(R)1. i:=02. while B[i]=0 do i:=i+13. up:= i, v:=i4. for i=up –1 downto 0 do5. if B[i]=1 then v:=i6. return v

Write(R,v)1. B[v]:=12. For i:=v-1 downto 0 do B[i]:=03. return ack

Page 21: Distributed Algorithms  (22903)

21

The linearization orderWrite1(R,1) Write2(R,4) Write3(R,3) Write4(R,1)

Read1(R, init) Read2(R, 4) Read4(R, 3)

Write1(R, 1)

Write2(R, 4)

Write3(R, 3)

Write4(R, 1)

Read1(R, init)

Read2(R, 4)

Read3(R, 4)

Read4(R, 3)

Read5(R, 1)

Writes linearized first

All reads from a specific write linearized after it, in their real-time order.

Read3(R, 4) Read5(R, 1)

Page 22: Distributed Algorithms  (22903)

22

Correctness proof for the SRSW multi-valued register

simulation

Page 23: Distributed Algorithms  (22903)

23

Illustration for Lemma 1

B01

v

u

1

0

v1

Written by W

v1 Written by W1

Page 24: Distributed Algorithms  (22903)

24

Illustration for Lemma 1

B01

v

u

1

0

v1 0

v2

Written by W

Written by W1

v2 Written by W2

Page 25: Distributed Algorithms  (22903)

25

Illustration for Lemma 2W(v

)E: R

Rπ:W(v

)(v’)

Case 1: v’ ≤ v

v

v’ 1 Written by W’

1 Written by W

0

00

0

0

W’(v’)

W’(v’)

Page 26: Distributed Algorithms  (22903)

26

Illustration for Lemma 2 (cont’d)

W(v)E: R

Rπ:W(v

)W’(v

’)

W’(v’)

(v’)

Case 2: v’ > v

v’ 1 Written by W’

v 1 Written by WWritten by W’’ 0

From Lemma 1,

R returns a value written

by an operation

that does not precede W!

W’(v’)

W’(v’)

W’’(x)

Page 27: Distributed Algorithms  (22903)

27

Illustration for Lemma 3

E:R1

R2

π: R2R1

W1(v1)W2(v2)

Case 1: v1 = v2

v1=v2 1 Written by W21Written by W1

Page 28: Distributed Algorithms  (22903)

28

Illustration for Lemma 3 (cont’d)

E:R1

R2

π: R2R1

W1(v1)W2(v2)

Case 2: v1 > v2

v1 1 Written by W1

v2 1 Written by W2

Since R1 precedes R2 and R2 reads from W2, R1

must see 1 in v2 when

scanning down

Page 29: Distributed Algorithms  (22903)

29

Illustration for Lemma 3 (cont’d)

E:R1

R2

π: R2R1

W1(v1)W2(v2)

Case 3: v1 < v2

v2 1 Written by W2

v1 1 Written by W1

From Lemma 1, R2 returns a value written

by an operation no sooner than

W3!

0Written by W3

W3(v3)

Page 30: Distributed Algorithms  (22903)

30

A wait-free Implementation of a (muti-valued) multi-reader register from (multi-valued)

SRSW registers.

Page 31: Distributed Algorithms  (22903)

31

Would this work?

Read(R) by pi

1. return Val[i]

Write(R,v)1. For i:=0 to n-1 do Val[i]:=v2. return ack

SRSW Val[i]: The value written by the writer for reader pi

Is the algorithm wait-free?Is the algorithm linearziable?

Yes

Nope

Page 32: Distributed Algorithms  (22903)

32

An example of a non-linearizable execution

Initially Val[0]=Val[1]=0

Read

Read Val[0]

Return 1

= linearization point

Read(1) precedes Read(0).

This is not linearizable!

Write(1)

Write 1 to Val[0]

Ack Write 1to Val[1]

Ack

AckPw:

P0:

P1:

Return 1

Read

Read Val[1]

Return 0

Return 0

Page 33: Distributed Algorithms  (22903)

3333

A proof that: no such simulation is possible, unless some readers…write!

Page 34: Distributed Algorithms  (22903)

34

A wait-free Implementation of a (muti-valued) multi-reader register from (multi-valued) SRSW

registers.

Data structures used

•Values are pairs of the form: <val, sequence-number>.

•Sequence-numbers are ever increasing.

Val[i]: The value written by pw for reader pi, for 1 ≤ i ≤ n

Report[i,j]: The value returned by the most recent read operation performed by pi; written by pi and read by pj, 1 ≤ i,j ≤ n.

Page 35: Distributed Algorithms  (22903)

35

A wait-free Implementation of a multi-reader register from SRSW registers

(cont’d).Initially Report[i,j]=Val[i]=(v0, 0), where v0 is R’s initial value.

Read(R) ; performed by process pr

1. (v[0],s[0]):=Val[r] ; most recent value returned by writer2. for (i:=1 to n do)

(v[i],s[i])=Report[i,r] ; most recent value reported to pr by reader pi

3. Let j be such that s[j]=max{s[0], s[1], …, s[n]}4. for i:=1 to n do Report[r,i]=(v[j],s[j]) ; pr reports to all readers

5. Return (v[j])Write(R,v) ; performed by the single writer

1. seq:=seq+12. for i=1 to n do Val[i]=(v,seq)3. return ack

Page 36: Distributed Algorithms  (22903)

36

The linearization orderWrite(v1, 1) Write(v2,2) Write(v3,3) Write(v4,4)

Read1(init, 0)

Read2(v1, 1)

Read4(v2, 2)

Read3(v2, 2)

Read5(v4, 4)

Write(v1, 1)

Write(v2, 2)

Write(v3, 3)

Write(v4, 4)

Read1(init, 0)

Read2(v1, 1)

Read3(v2, 2)

Read4(v2, 2)

Read5(v4, 4)

Writes linearized first

Reads considered according to increasing order of response, and put after the write with same sequence ID.

Page 37: Distributed Algorithms  (22903)

37

A wait-free Implementation of a multi-reader-multi-writer register from multi-reader-single-writer

registers

Page 38: Distributed Algorithms  (22903)

38

A wait-free Implementation of a MRMW register from MRSW registers.

Data structures used

•Values are pairs of the form: <val, sequence-number>.

•Sequence-numbers are ever increasing.

TS[i]: The vector timestamp of writer pi, for 0 ≤ i ≤ m-1. Written by pi and read by all writers.

Val[i]: The latest value written by writer pi, for 0 ≤ i ≤ m-1, together with the vector timestamp associated with that value. Written by pi and read by all n readers.

Page 39: Distributed Algorithms  (22903)

39

Concurrent timestamps

• Provide a total order for write operations

• The total order respects the partial order of write operations

• Timestamp implemented as vectors

• Ordered by lexicographic order

• Each writer increments its vector entry

Page 40: Distributed Algorithms  (22903)

40

Concurrent timestamps example

Writer 1

Writer 2

Writer 3

TS[1]

TS[2]

TS[3]

<0,0,0>

<0,0,0>

<0,0,0>

Order:<0,0,0>

< , , >

0

100

Page 41: Distributed Algorithms  (22903)

41

Concurrent timestamps example

Writer 1

Writer 2

Writer 3

TS[1]

TS[2]

TS[3]

<1,0,0>

<0,0,0>

<0,0,0>

Order:<0,0,0>

< , , >100

< , , >110

<1,0,0>

Page 42: Distributed Algorithms  (22903)

42

Concurrent timestamps example

Writer 1

Writer 2

Writer 3

TS[1]

TS[2]

TS[3]

<1,0,0>

<1,1,0>

<0,0,0>

Order:<0,0,0>

< , , >100

< , , >110

<1,0,0><1,1,0>

< , , >

< , , >

1 1 1

1 2 1

<1,2,1><1,1,1>

Page 43: Distributed Algorithms  (22903)

43

A wait-free Implementation of a MRMW register from MRSW registers.

Initially TS[i]=<0,0,…,0> and Val[i] equals the initial value of R

Read(R) ; performed by reader pr

1. for i:=0 to m-1 do (v[i], t[i]):=Val[i] ; v and t are local2. Let j be such that t[j]=max{t[0],…,t[m-1]} ; Lexicographic max3. Return v[j]

Write(R,v) ; performed by the writer pw

1. ts=NewCTS() ; Writer pw obtains a new vector timestamp

2. Val[w]:=(v,ts)3. return ack

Procedure NewCTS() ; called by writer pw

1. for i:=0 to m-1 do2. lts[i]:=TS[i].i ; extract the i’th entry from TS of the i’th writer3. lts[w]=lts[w]+1 ; Increment own entry4. TS[w]=lts ; write pw’s new timestamp

5. return lts

Page 44: Distributed Algorithms  (22903)

44

The linearization orderWrite(v1, <1,0>) Write(v4,<2,2>)

Read1(init, <0,0>)

Read2(init, <0,0>)

Read4(v2, <1,1>)

Read3(v2, <1,1>)

Read5(v4, <2,2>)

Write(v1, <1,0>)

Write(v2, <1,1>)

Write(v3, <1,2>)

Write(v4, <2,2>)

Read1(init, <0,0>)

Read2(init, <0,0>)

Read3(v2, <1,1>)

Read4(v2, <1,1>)

Read5(v4, <2,2>)

Writes linearized first by timestamp order

Reads considered according to increasing order of response, and put after the write with same timestamped

Writer 1

Writer 2Write(v2, <1,1>) Write(v3,<1,2>)

Reader 1

Reader 2