59
Nested Parallelism in Transactional Memory Kunal Agrawal, Jeremy T. Fineman and Jim Sukha MIT

Nested Parallelism in Transactional Memory

  • Upload
    field

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

Nested Parallelism in Transactional Memory. Kunal Agrawal , Jeremy T. Fineman and Jim Sukha MIT. Program Representation. ParallelIncrement (){ parallel { x ← x+1 }//Thread1 { x ← x+1 }//Thread2 }. - PowerPoint PPT Presentation

Citation preview

Page 1: Nested Parallelism in Transactional Memory

Nested Parallelism in Transactional Memory

Kunal Agrawal, Jeremy T. Fineman and Jim Sukha

MIT

Page 2: Nested Parallelism in Transactional Memory

Program RepresentationParallelIncrement(){ parallel { x ← x+1 }//Thread1 { x ← x+1 }//Thread2}

The parallel keyword allows the two following code blocks (enclosed in {.}) to execute in parallel.

Page 3: Nested Parallelism in Transactional Memory

Program Representation

S1 S2

S0

P1

R x W xR x W x

ParallelIncrement(){ parallel { x ← x+1 }//Thread1 { x ← x+1 }//Thread2}

The parallel keyword allows the two following code blocks (enclosed in {.}) to execute in parallel.

• We model the execution of a multithreaded program as a walk of a series-parallel computation tree.

Page 4: Nested Parallelism in Transactional Memory

Program Representation

S1 S2

S0

P1

R x W xR x W x

ParallelIncrement(){ parallel //P1

{ x ← x+1 }//S1

{ x ← x+1 }//S2

}

The parallel keyword allows the two following code blocks (enclosed in {.}) to execute in parallel.

• We model the execution of a multithreaded program as a walk of a series-parallel computation tree.

• Internal nodes of the tree are S (series) or P (parallel) nodes. The leaves of the tree are memory operations.

u1 u2 u3 u4

Page 5: Nested Parallelism in Transactional Memory

Program Representation

S1 S2

S0

P1

R x W xR x W x The parallel keyword allows the two following code blocks (enclosed in {.}) to execute in parallel.

• We model the execution of a multithreaded program as a walk of a series-parallel computation tree.

• Internal nodes of the tree are S (series) or P (parallel) nodes. The leaves of the tree are memory operations.

• All child subtrees of an S node must execute in series in left-to-right order. The child subtrees of a P node can potentially execute in parallel.

ParallelIncrement(){ parallel //P1

{ x ← x+1 }//S1

{ x ← x+1 }//S2

}

u1 u2 u3 u4

Page 6: Nested Parallelism in Transactional Memory

Data Races

• Two (or more) parallel accesses to the same memory location (where one of the accesses is a write) constitute a data race. (In the tree, two accesses can happen in parallel if their least common ancestor is a P node.)

S1 S2

S0

P1

R x W xR x W x

ParallelIncrement(){ parallel //P1

{ x ← x+1 }//S1

{ x ← x+1 }//S2

}

u1 u2 u3 u4

There are races between u1 and u4, u3 and u2, and u2 and u4.

Page 7: Nested Parallelism in Transactional Memory

Data Races

• Two (or more) parallel accesses to the same memory location (where one of the accesses is a write) constitute a data race. (In the tree, two accesses can happen in parallel if their least common ancestor is a P node.)

• Data races lead to nondeterministic program behavior.

• Traditionally, locks are used to prevent data races.

S1 S2

S0

P1

R x W xR x W x

ParallelIncrement(){ parallel //P1

{ x ← x+1 }//S1

{ x ← x+1 }//S2

}

u1 u2 u3 u4

There are races between u1 and u4, u3 and u2, and u2 and u4.

Page 8: Nested Parallelism in Transactional Memory

Transactional Memory

S1 S2

S0

P1

R x W x

ParallelIncrement(){ parallel //P1

{ atomic{x ← x+1}//A }//S1

{ atomic{x ← x+1}//B }//S2

}

B

R x W x

A

• Transactional memory has been proposed as an alternative to locks.

• The programmer simply encloses the critical region in an atomic block. The runtime system ensures that the region executes atomically by tracking its reads and writes, detecting conflicts, and aborting and retrying if necessary.

u1 u2 u3 u4

Page 9: Nested Parallelism in Transactional Memory

Nested Parallelism

One can generate more parallelism by nesting parallel blocks.

ParallelIncrement(){ parallel //P1

{ x ← x+1 }//S1

{ x ← x+1 parallel //P2

{ x ← x+1 }//S3

{ x ← x+1 }//S4

}//S2

}

S1 S2

S0

P1

S3 S4

R x W x R x W x

R x W x

P2R x W xu1 u2

u3 u4

u5 u6 u7 u8

Page 10: Nested Parallelism in Transactional Memory

Nested Parallelism in Transactions

Use transactions to prevent data races. (Notice the parallelism inside transaction B.)

S1 S2

S0

P1

B

S3 S4

R x W x R x W x

P2R x W x

A

R x W x

ParallelIncrement(){ parallel //P1

{ atomic{x ← x+1}//A }//S1

{ atomic{ x ← x+1 parallel //P2

{ x ← x+1 }//S3

{ x ← x+1 }//S4

}//B }//S2

}

u1 u2

u3 u4

u5 u6 u7 u8

Page 11: Nested Parallelism in Transactional Memory

Nested Parallelism in Transactions

Use transactions to prevent data races. (Notice the parallelism inside transaction B.)

This program unfortunately has data races.

S1 S2

S0

P1

B

S3 S4

R x W x R x W x

P2R x W x

A

R x W x

ParallelIncrement(){ parallel //P1

{ atomic{x ← x+1}//A }//S1

{ atomic{ x ← x+1 parallel //P2

{ x ← x+1 }//S3

{ x ← x+1 }//S4

}//B }//S2

}

u1 u2

u3 u4

u5 u6 u7 u8

Page 12: Nested Parallelism in Transactional Memory

ParallelIncrement(){ parallel { atomic{x ← x+1}//A } { atomic{ x ← x+1 parallel { atomic{x ← x+1}//C } { atomic{x ← x+1}//D } }//B }//S2

}

S1 S2

S0

P1

A B

S3 S4

C D

R x W x R x W x

R x W x

P2R x W x

Nested Parallelism and Nested Transactions

Add more transactions

u1 u2

u3 u4

u5 u6 u7 u8

Page 13: Nested Parallelism in Transactional Memory

ParallelIncrement(){ parallel { atomic{x ← x+1}//A } { atomic{ x ← x+1 parallel { atomic{x ← x+1}//C } { atomic{x ← x+1}//D } }//B }//S2

}

S1 S2

S0

P1

A B

S3 S4

C D

R x W x R x W x

R x W x

P2R x W x

Nested Parallelism and Nested Transactions

Transactions C and D are nested inside transaction B. Therefore transaction B has both nested transactions and nested parallelism.

u1 u2

u3 u4

u5 u6 u7 u8

Page 14: Nested Parallelism in Transactional Memory

Our Contribution• We describe CWSTM, a theoretical design for a

software transactional memory system which allows nested parallelism in transactions for dynamic multithreaded languages which use a work-stealing scheduler.

• Our design efficiently supports nesting and parallelism of unbounded depth.

• CWSTM supports– Efficient Eager Conflict detection, and– Eager Updates (Fast Commits).

• We prove that CWSTM exhibits small overhead on a program with transactions compared to the same program with all atomic blocks removed.

Page 15: Nested Parallelism in Transactional Memory

More Precisely…

• A work-stealing scheduler guarantees that a transaction-less program with work T1 and critical path T∞ running on P processors completes in time O(T1/P + T∞).– Provides linear speedup when T1/T∞ >> P.

Page 16: Nested Parallelism in Transactional Memory

More Precisely…

• A work-stealing scheduler guarantees that a transaction-less program with work T1 and critical path T∞ running on P processors completes in time O(T1/P + T∞).– Provides linear speedup when T1/T∞ >> P.

• If a program has no aborts and no read contention*, then CWSTM completes the program with transactions in time O(T1/P + PT∞).– Provides linear speedup when T1/T∞ >> P 2.

Page 17: Nested Parallelism in Transactional Memory

More Precisely…

• A work-stealing scheduler guarantees that a transaction-less program with work T1 and critical path T∞ running on P processors completes in time O(T1/P + T∞).– Provides linear speedup when T1/T∞ >> P.

• If a program has no aborts and no read contention*, then CWSTM completes the program with transactions in time O(T1/P + PT∞).– Provides linear speedup when T1/T∞ >> P 2.

*In the presence of multiple readers, a write to a memory

location has to check for conflicts against multiple readers.

Page 18: Nested Parallelism in Transactional Memory

Outline

• Introduction• Semantics of TM• Difficulty of Conflict Detection• Access Stack• Lazy Access Stack• Intuition for Final Design Using Traces and

Analysis• Conclusions and Future Work

Page 19: Nested Parallelism in Transactional Memory

Conflicts in Transactions

• Transactional memory optimistically executes transactions and maintains the write set W(T) for each transaction T.

• Active transactions A and B conflict iff they are in parallel with each other and their write sets overlap.

S1 S2

S0

P1

parallel { atomic{ x ← 1 y ← 2 }//A }//S1

{ atomic{ z ← 3 atomic{ z ← 4 x ← 5 }//C }//B }//S2

B

W y

AW(A)={} W(B)={}

W zW x

W x

C

W z

W(C)={}

u1 u2 u3

u4 u5

Page 20: Nested Parallelism in Transactional Memory

Conflicts in Transactions

S1 S2

S0

P1

B

W y

AW(A)={x} W(B)={}

W zW x

W x

C

W z

W(C)={}

• Transactional memory optimistically executes transactions and maintains the write set W(T) for each transaction T.

• Active transactions A and B conflict iff they are in parallel with each other and their write sets overlap.

parallel { atomic{ x ← 1 y ← 2 }//A }//S1

{ atomic{ z ← 3 atomic{ z ← 4 x ← 5 }//C }//B }//S2

u1 u2 u3

u4 u5

Page 21: Nested Parallelism in Transactional Memory

Conflicts in Transactions

S1 S2

S0

P1

B

W y

AW(A)={x} W(B)={z}

W zW x

W x

C

W z

W(C)={}

• Transactional memory optimistically executes transactions and maintains the write set W(T) for each transaction T.

• Active transactions A and B conflict iff they are in parallel with each other and their write sets overlap.

parallel { atomic{ x ← 1 y ← 2 }//A }//S1

{ atomic{ z ← 3 atomic{ z ← 4 x ← 5 }//C }//B }//S2

u1 u2 u3

u4 u5

Page 22: Nested Parallelism in Transactional Memory

Conflicts in Transactions

S1 S2

S0

P1

B

W y

AW(A)={x} W(B)={z}

W zW x

W x

C

W z

W(C)={z}

• Transactional memory optimistically executes transactions and maintains the write set W(T) for each transaction T.

• Active transactions A and B conflict iff they are in parallel with each other and their write sets overlap.

parallel { atomic{ x ← 1 y ← 2 }//A }//S1

{ atomic{ z ← 3 atomic{ z ← 4 x ← 5 }//C }//B }//S2

u1 u2 u3

u4 u5

Page 23: Nested Parallelism in Transactional Memory

Conflicts in Transactions

S1 S2

S0

P1

B

W y

AW(A)={x} W(B)={z}

W zW x

CONFLICT!!

W x

C

W z

W(C)={z, x}

• Transactional memory optimistically executes transactions and maintains the write set W(T) for each transaction T.

• Active transactions A and B conflict iff they are in parallel with each other and their write sets overlap.

parallel { atomic{ x ← 1 y ← 2 }//A }//S1

{ atomic{ z ← 3 atomic{ z ← 4 x ← 5 }//C }//B }//S2

u1 u2 u3

u4 u5

Page 24: Nested Parallelism in Transactional Memory

Nested Transactions: Commit and Abort

• If two transactions conflict, one of them is aborted and its write set is discarded. W(A)={y} W(B)={z, x}

W(C)={z, u}

S1 S2

S0

P1

B

S3 S4

C D

W z W z W y

P2W z W x

A

W y W x

W u

Page 25: Nested Parallelism in Transactional Memory

Nested Transactions: Commit and Abort

• If two transactions conflict, one of them is aborted and its write set is discarded. W(A)={y} W(B)={z, x}

W(C)={z, u}

S1 S2

S0

P1

B

S3 S4

C D

W z W z W y

P2W z W x

A

W y W x

W u

Page 26: Nested Parallelism in Transactional Memory

Nested Transactions: Commit and Abort

• If two transactions conflict, one of them is aborted and its write set is discarded.

• If a transaction completes without a conflict, it is committed and its write set is merged with it’s parent transaction’s write set

W(B)={z, x}

W(C)={z, u}

S1 S2

S0

P1

B

S3 S4

C D

W z W u W z W y

P2W z W x

W y W x

W(A)={y}A

Page 27: Nested Parallelism in Transactional Memory

Nested Transactions: Commit and Abort

• If two transactions conflict, one of them is aborted and its write set is discarded.

• If a transaction completes without a conflict, it is committed and its write set is merged with it’s parent transaction’s write set

W(B)={z, x, u}

W(C)={z, u}

S1 S2

S0

P1

B

S3 S4

C D

W z W u W z W y

P2W z W x

W y W x

W(A)={y}A

Page 28: Nested Parallelism in Transactional Memory

Outline

• Introduction• Semantics of TM• Difficulty of Conflict Detection• Access Stack• Lazy Access Stack• Intuition for Final Design Using Traces and

Analysis• Conclusions and Future Work

Page 29: Nested Parallelism in Transactional Memory

Conflicts in Serial Transactions• Virtually all proposed TM

systems focus on the case where transactions are serial (no P nodes in subtrees of transactions).

• Two writes to the same memory location cause a conflict if and only if they are on different threads.

• TM system can just check to see if some other thread wrote to the memory location.

S1 S2

S0

P1

B

W z

A

W x D

W zW z W x

C

F

W z W x

E

W z

Thread 1Thread 2

location Threadx 2

z 1

Page 30: Nested Parallelism in Transactional Memory

Conflicts in Serial Transactions• Virtually all proposed TM

systems focus on the case where transactions are serial (no P nodes in subtrees of transactions).

• Two writes to the same memory location cause a conflict if and only if they are on different threads.

• TM system can just check to see if some other thread wrote to the memory location.

S1 S2

S0

P1

B

W z

A

W x D

W zW z W x

C

F

W z W x

E

W z

Thread 1Thread 2

location Threadx 2

z 1

CONFLICT!!

Page 31: Nested Parallelism in Transactional Memory

Thread ID is not enough• A work-stealing

scheduler does not create a thread for every S-node; instead, it schedules a computation on a fixed number of worker threads.

• Runtime can not simply compare worker ids to determine whether two transactions conflict.

X0

P1

S1

Y1

S2

P2

S5 S6P2

S3 S4

Y2 P6

S11 S12

Y3

X2

P5

S10 S11

Z3P8

S15 S16

Z4

P4

S7 S8

Z1P7

S13 S14

Z2

W(Y1)={x,..}

Inactive

1 323 workers

Unexecuted

W(Y3)={x,..}

Page 32: Nested Parallelism in Transactional Memory

Thread ID is not enoughX0

P1

S1

Y1

S2

P2

S5 S6P2

S3 S4

Y2 P6

S11 S12

Y3

X2

P5

S10 S11

Z3P8

S15 S16

Z4

P4

S7 S8

Z1P7

S13 S14

Z2

W(Y1)={x,..}

Inactive

1 323 workers

Unexecuted

W(Y3)={x,..}

W(Y2)={x,..}

• A work-stealing scheduler does not create a thread for every S-node; instead, it schedules a computation on a fixed number of worker threads.

• Runtime can not simply compare worker ids to determine whether two transactions conflict.

Page 33: Nested Parallelism in Transactional Memory

Outline

• Introduction• Semantics of TM• Difficulty of Conflict Detection• Access Stack• Lazy Access Stack• Intuition for Final Design Using Traces and

Analysis• Conclusions and Future Work

Page 34: Nested Parallelism in Transactional Memory

CWSTM Invariant: Conflict-Free Execution

INVARIANT 1: At any time, for any given location L, all active transactions that have L in their writeset fall along a (root-to-leaf) chain.

P

S S

P

S S

P

S S

P

S S

Z2Y3

Y0

Y2

P

S S

Inactive

Trans accessed L

ActiveZ1

Y1

Page 35: Nested Parallelism in Transactional Memory

CWSTM Invariant: Conflict-Free Execution

INVARIANT 1: At any time, for any given location L, all active transactions that have L in their writeset fall along a (root-to-leaf) chain. Let X be the end of the chain.

P

S S

P

S S

P

S S

P

S S

Z2Y3

Y0

Y2

P

S S

Inactive

Trans accessed L

ActiveZ1

Y1

X

Page 36: Nested Parallelism in Transactional Memory

CWSTM Invariant: Conflict-Free Execution

INVARIANT 2: If Z tries to access object L:• No conflict if X is an

ancestor of Z. (e.g., Z1).• Conflict if X is not an

ancestor of Z. (e.g., Z2).

INVARIANT 1: At any time, for any given location L, all active transactions that have L in their writeset fall along a (root-to-leaf) chain. Let X be the end of the chain.

P

S S

P

S S

P

S S

P

S S

Z2Y3

Y0

Y2

P

S S

Inactive

Trans accessed L

ActiveZ1

Y1

X

Page 37: Nested Parallelism in Transactional Memory

Design Attempt 1

P

S S

P

S S

P

S S

P

S S

Z2

Y0

Y1

Y3

:

Top = X

Y3

Y0

Y2

P

S S

Inactive

Trans accessed L

ActiveZ1

Y1

Access Stack for L.

X

• For every L, keep an access stack for L, holding the chain of active transactions which have L in their writeset.

Page 38: Nested Parallelism in Transactional Memory

Design Attempt 1

P

S S

P

S S

P

S S

P

S S

Z2

Y0

Y1

Y3

:

Top = X

Y3

Y0

Y2

P

S S

Inactive

Trans accessed L

ActiveZ1

Y1

Access Stack for L.

X

• For every L, keep an access stack for L, holding the chain of active transactions which have L in their writeset.– Access stacks are

changed on commits and aborts. If Y3 commits, it is replaced by Y2. If Y3 aborts, it disappears from the stack and Y1 is at the top.

Page 39: Nested Parallelism in Transactional Memory

Design Attempt 1• For every L, keep an access

stack for L, holding the chain of active transactions which have L in their writeset.– Access stacks are

changed on commits and aborts. If Y3 commits, it is replaced by Y2. If Y3 aborts, it disappears from the stack and Y1 is at the top.

• Let X be the top of access stack for L. When transaction Z tries to access L, report a conflict if and only if X is not an ancestor of Z.

P

S S

P

S S

P

S S

P

S S

Z2

Y0

Y1

Y3

:

Top = X

Y3

Y0

Y2

P

S S

Inactive

Trans accessed L

ActiveZ1

Y1

Access Stack for L.

X

Page 40: Nested Parallelism in Transactional Memory

Maintenance of access stack on commit.

• Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li.

W(Yd)= {Ld}

W(Yd-1)={Ld-1}

W(Yd-2)= {Ld-2}

W(Y2)= {L2}

...

W(Y1)= {L1}

W(Y0)= {L0}

Yd-2

Y0

Y1

Y2

Yd-1

Yd

Page 41: Nested Parallelism in Transactional Memory

Maintenance of access stack on commit.

• Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li.– Total work with no transactions:

O(d).

W(Yd)= {Ld}

W(Yd-1)={Ld-1}

W(Yd-2)= {Ld-2}

W(Y2)= {L2}

...

W(Y1)= {L1}

W(Y0)= {L0}

Yd-2

Y0

Y1

Y2

Yd-1

Yd

Page 42: Nested Parallelism in Transactional Memory

Maintenance of access stack on commit.

• Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li.– Total work with no transactions:

O(d).• On commit of a transaction, the

access stacks of all the memory locations its write set must be updated.

W(Yd)= {Ld}

W(Yd-1)={Ld-1}

W(Yd-2)= {Ld-2}

W(Y2)= {L2}

...

W(Y1)= {L1}

W(Y0)= {L0}

Yd-2

Y0

Y1

Y2

Yd-1

Yd

Page 43: Nested Parallelism in Transactional Memory

Maintenance of access stack on commit.

• Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li.– Total work with no transactions:

O(d).• On commit of a transaction, the

access stacks of all the memory locations its write set must be updated.

W(Yd)= {Ld}

W(Yd-1)={Ld-1,Ld}

W(Yd-2)= {Ld-2}

W(Y2)= {L2}

O(1)

...

W(Y1)= {L1}

W(Y0)= {L0}

Yd-2

Y0

Y1

Y2

Yd-1

Yd

Page 44: Nested Parallelism in Transactional Memory

Maintenance of access stack on commit.

W(Yd)= {Ld}

W(Yd-1)={Ld-1,Ld}

W(Yd-2)= {Ld-2,Ld-1,Ld}

W(Y2)= {L2}

O(1)

O(2)

...

W(Y1)= {L1}

W(Y0)= {L0}

Yd-2

Y0

Y1

Y2

Yd-1

Yd

• Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li.– Total work with no transactions:

O(d).• On commit of a transaction, the

access stacks of all the memory locations its write set must be updated.

Page 45: Nested Parallelism in Transactional Memory

Maintenance of access stack on commit.

W(Yd)= {Ld}

W(Yd-1)={Ld-1,Ld}

W(Yd-2)= {Ld-2,Ld-1,Ld}

W(Y2)= {L2,L3 … Ld-1,Ld}

O(1)

O(2)

O(d-1)

...

W(Y1)= {L1,L2,L3 … Ld-1,Ld}

W(Y0)= {L0}

Yd-2

Y0

Y1

Y2

Yd-1

Yd

• Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li.– Total work with no transactions:

O(d).• On commit of a transaction, the

access stacks of all the memory locations its write set must be updated.

Page 46: Nested Parallelism in Transactional Memory

Maintenance of access stack on commit.

W(Yd)= {Ld}

W(Yd-1)={Ld-1,Ld}

W(Yd-2)= {Ld-2,Ld-1,Ld}

W(Y2)= {L2,L3 … Ld-1,Ld}

O(1)

O(2)

O(d-1)

O(d)

...

W(Y1)= {L1,L2,L3 … Ld-1,Ld}

W(Y0)= {L0,L1,L2,… Ld-1, Ld}

Yd-2

Y0

Y1

Y2

Yd-1

Yd

• Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li.– Total work with no transactions:

O(d).• On commit of a transaction, the

access stacks of all the memory locations its write set must be updated.

Page 47: Nested Parallelism in Transactional Memory

Maintenance of access stack on commit.

W(Yd)= {Ld}

W(Yd-1)={Ld-1,Ld}

W(Yd-2)= {Ld-2,Ld-1,Ld}

W(Y2)= {L2,L3 … Ld-1,Ld}

O(1)

O(2)

O(d-1)

O(d)

...

W(Y1)= {L1,L2,L3 … Ld-1,Ld}

W(Y0)= {L0,L1,L2,… Ld-1, Ld}

Yd-2

Y0

Y1

Y2

Yd-1

Yd

• Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li.– Total work with no transactions:

O(d).• On commit of a transaction, the

access stacks of all the memory locations its write set must be updated.

• On commit of transaction Yi, (d-i+1) access stacks must be updated.

Page 48: Nested Parallelism in Transactional Memory

Maintenance of access stack on commit.

W(Yd)= {Ld}

W(Yd-1)={Ld-1,Ld}

W(Yd-2)= {Ld-2,Ld-1,Ld}

W(Y2)= {L2,L3 … Ld-1,Ld}

O(1)

O(2)

O(d-1)

O(d)

...

W(Y1)= {L1,L2,L3 … Ld-1,Ld}

W(Y0)= {L0,L1,L2,… Ld-1, Ld}

Yd-2

Y0

Y1

Y2

Yd-1

Yd

• Consider a serial program with a chain of nested transactions, Y0, Y1, … Yd. Each Yi accesses a unique location Li.– Total work with no transactions:

O(d).• On commit of a transaction, the

access stacks of all the memory locations its write set must be updated.

• On commit of transaction Yi, (d-i+1) access stacks must be updated.– Overhead due to transaction

commits: O(d2).

Page 49: Nested Parallelism in Transactional Memory

Outline

• Introduction• Semantics of TM• Difficulty of Conflict Detection• Access Stack• Lazy Access Stack• Intuition for Final Design Using Traces and

Analysis• Conclusions and Future Work

Page 50: Nested Parallelism in Transactional Memory

Lazy Access Stack

P

S S

P

S S

P

S S

P

S S

Y0

Y1

Y4

Y5

Y6

Y8Z3 Z4

Y2

Y3

P

S S

P

S S

Y9 Z2Z1

Y7

Lazy Access Stack for L.

Y0

Y1

Y2

Top = X

Y3

Y4

Y5

Y7

Y9

Y0

Y3

Y6

Y8

Equivalent (Non-Lazy) Access Stack

Don’t update access stacks on commits. Every transaction Y in the stack implicitly represents its closest active transactional ancestor.

Inactive

Trans accessed L

Active

X

Page 51: Nested Parallelism in Transactional Memory

The Oracle

TheOracle(Y, Z) {

X ← Y’s closest active ancestor transaction

if (X is an ancestor of Z)

return “no conflict” else return “conflict”;}

When an Z tries to access location L and L has transaction Y (possibly inactive) on top of its access stack,

P

S S

P

S S

P

S S

P

S S

Y0

Y1

Y4

Y5

Y6

Y8Z3 Z4

Y2

Y3

P

S S

P

S S

Y7

Inactive

Trans accessed L

Active

Y9 Z2Z1

X

Page 52: Nested Parallelism in Transactional Memory

Closest Active Ancestor

TheOracle(Y, Z) {

X ← Y’s closest active ancestor transaction

if (X is an ancestor of Z)

return “no conflict” else return “conflict”;}

When an Z tries to access location L and L has transaction Y (possibly inactive) on top of its access stack,

Walk up the tree to find the X.

P

S

P

S

Y0

Y1

Yd-1

Y3

P

S S

Y2

Y3

Yd

Y4

Yd

Y0

Y1

Y0Y2

Lazy Stack

Inactive

Trans accessed L

Active

Page 53: Nested Parallelism in Transactional Memory

Closest Active Ancestor

TheOracle(Y, Z) {

X ← Y’s closest active ancestor transaction

if (X is an ancestor of Z)

return “no conflict” else return “conflict”;}

When an Z tries to access location L and L has transaction Y (possibly inactive) on top of its access stack,

Walk up the tree to find the X.

PROBLEM: Each memory access might take d time (d is the nesting depth).

P

S

P

S

Y0

Y1

Yd-1

Y3

P

S S

Y2

Y3

Yd

Y4

Yd

Y0

Y1

Y0Y2

Lazy Stack

Inactive

Trans accessed L

Active

Page 54: Nested Parallelism in Transactional Memory

Closest Active Ancestor

TheOracle(Y, Z) {

X ← Y’s closest active ancestor transaction

if (X is an ancestor of Z)

return “no conflict” else return “conflict”;}

When an Z tries to access location L and L has transaction Y (possibly inactive) on top of its access stack,

CWSTM uses an XConflict data structure which supports the above query in O(1) time, because it does not always need to find X to answer the query.

P

S

P

S

Y0

Y1

Yd-1

Y3

P

S S

Y2

Y3

Yd

Y4

Yd

Y0

Y1

Y0Y2

Lazy Stack

Inactive

Trans accessed L

Active

Page 55: Nested Parallelism in Transactional Memory

Outline

• Introduction• Computation Tree• Definition of Conflicts and Design Attempt 1• Access Stack• Lazy Access Stack• Intuition for Final Design Using Traces and

Analysis• Conclusions and Future Work

Page 56: Nested Parallelism in Transactional Memory

Traces

X0

P1

S1

X1

S2

P2

S3 S4P3

S5 S6

Y1 P4

S8 S9

Y2

X2

P5

S10 S11

Z1P6

S12 S13

Z2

1 323 workers

To support XConflict queries efficiently, we group sections of the computation tree into traces.*

• Every trace executes serially on one processor; no synchronization overhead within a trace.

• Traces are created and modified only on steals.

In CWSTM: # traces = O(# steals)Work Stealing Theorem: # steals is smallOverhead of maintaining traces small.

Page 57: Nested Parallelism in Transactional Memory

The XConflict Query with Traces

TheOracle(Y, Z) {

X ← Y’s closest active ancestor transaction

if (X is an ancestor of Z) return “no conflict” else return “conflict”;}

When a transaction Z tries to access location L and L has transaction Y (possibly inactive) on top of its access stack,

XConflict(Y, Z) {

UX ← trace containing X

// (X is Y’s closest // active ancestor // transaction ) if (Ux is an ancestor of trace containing Z) return “no conflict” else return “conflict”;}

Oracle Query Actual CWSTM Query

Page 58: Nested Parallelism in Transactional Memory

Sources of Overhead in CWSTM

• Building the computation tree.• Queries to XConflict:

– At most one for every memory access*.

• Updates to traces for XConflict.– Creating/splitting traces.– Maintaining data structures for ancestor queries

on traces. – Merging complete traces together.

• No rollbacks or retries if we assume no conflicts.

THEOREM: For a computation with no transaction conflicts and no concurrent readers to a shared memory location, CWSTM executes the computation in O(T1/P + PT∞) time.

O(1)-factor increase on total work (T1).Increases critical path to O(PT∞).

*Assuming no concurrent reads to the same location and no aborts.

Page 59: Nested Parallelism in Transactional Memory

Future Work

• CWSTM is the first design which supports nested parallelism and nested transactions in TM and guarantees low overhead (asymptotically).

• In Future-– Implement CWSTM in the Cilk runtime system

and evaluate its performance.– Is there a better design which handles

concurrent readers more efficiently?– Nested Parallelism in TM for other schedulers.