Two Ways of Speeding Up Transactional Memory Algorithms Vincent Gramoli Joint work with Pascal Felber, Rachid Guerraoui, Derin Harmanci

Two Ways of Speeding Up Transactional Memory Algorithms

Vincent Gramoli Joint work with

Pascal Felber, Rachid Guerraoui, Derin Harmanci

Roadmap1. Motivations

2. Transactional Memory

3. Problems of Efficiency

4. Input Acceptance

5. Elastic Transactions

6. Conclusion

Single CPU Limitations

• Transistor size still decreases [Moore’s law]

• Induced overheating disturbs computation

• Clock speed no longer doubles since 2004

[“The free lunch is over” by Herb Sutter]

Manufactured MulticoresIn

tel C

OO anno

unce

s

Multi

core

revo

lution

AMD an

noun

ces t

he

2-co

re O

pter

on

AMD an

noun

ces t

he

4-co

re O

pter

on

Inte

l ano

unce

s 4-co

re

Xeon

5000

serie

s

Inte

l ann

ounc

es

8-co

re N

ahele

m EX

SUN N

iagar

a 2 w

/ 8

core

s & 64

HW

thre

ads

Inte

l ann

ounc

es 6-

core

Xeon

7000

serie

s

SUN an

noun

ces t

he

8-co

re N

iagar

a

Concurrent Programming

• Difficult task:– Using locks, how to avoid deadlock?

Thread1 {lock(x); lock(y);} // Thread2 {lock(y); lock(x);}




– Using lock-free (LF) primitives, how can composition preserve atomicity?LF-move(x,y) ≠ LF-delete(x) + LF-insert(y)





• Dedicated to expert programmers:– Database programmers– Scientific computing programmers– What about other programmers?





• Dedicated to expert programmers:– Database programmers– Scientific computing programmers– What about other programmers?

• Democratizing multicores requires new programming abstractions




4. Input Acceptance


6. Conclusion

Transactional Memory

An abstraction: a black box that encapsulates all synchronizations– all read/write accesses to shared data are protected transparently

BEGIN_TX R(act) W(act,v)END_TX

Assume we want to read (R) and write (W) a shared bank account ‘act’ atomically.We simply have to label the region of the sequential code using transaction delimiters BEGIN_TX and END_TX



TMTM

after this point, operations will be handled by the TMBEGIN_TX R(act) W(act,v)END_TX



TMTM

read through the TM?BEGIN_TX R(act) W(act,v)END_TX



read through the TM?

TMTM

Sounds good, I keep track of

your read


your read




you can return v1

TMTM



BEGIN_TX R(act) W(act,v’)END_TX


TMTM

write through the TM?





TMTM


your write


your write






write has been scheduled

TMTM






TMTM





TMTM

No way, there is a risk of

safety violation


safety violation




abort, roll-back, and restart the whole transaction later on

TMTM


safety violation


safety violation



after this point, all operations become unprotected again




– atomicity is preserved under transaction composition

move(acc1, acc2, amt) { BEGIN_TX delete(act1, amt) insert(acc2, amt) END_TX}

delete(acc, amt) { BEGIN_TX v = R(act) W(act,v-amt) END_TX}

insert(acc, amt) { BEGIN_TX v = R(act) W(act,v+amt) END_TX}

+ =




4. Input Acceptance


6. Conclusion

1st Problem: Wasted Effort Problem

Transactions waste efforts while aborting and rolling-back

BEGIN_TX W(x) END_TX

BEGIN_TX R(x)

END_TX

(1)

(2)

(3)

(4)

Although transactions can commit safely one is aborted by common STMs:

TL2, WSTM, DSTM, TinySTM

Some aborts are unnecessary

2nd Problem: Lack of ConcurrencyTransactions ensure stronger guarantees than necessaryExample: sorted linked list implementation of integer set

zzyy tt

insert(x)/search(z)

xx

hh

BEGIN_TX R(h) R(y) R(z)END_TX

BEGIN_TX … W(h)END_TX

search(z) insert(x)

2nd Problem: Lack of ConcurrencyTransactions ensure stronger guarantees than necessaryExample: sorted linked list implementation of integer set

Both transactions could commit w/o violating linked list linearizability, but transactional models consider read/write atomicity.

zzyy tt

insert(x)/search(z)

xx

hh



search(z) insert(x)

Roadmap

1. Motivations



4. Input Acceptance


6. Conclusion

A Metric for Input Acceptance• TM efficiency depends on

– Execution speed– Number of successful (committed) transactions



TMTM



• The Input acceptance is the ability for a TM to commit transactions

• The commit-abort ratio is “σ”: # committed tx / # complete tx

TMTM

How do STMs perform w.r.t. this metric?

• Ideal goal: no abort (σ = 1)

• A TM accepts an input if σ = 1

• What is accepted by the existing STMs?

Identifying TM designs

Designs Meaning TM examples

VWVR Visible writeVisible read

SXM

VWIR Visible writeInvisible read

DSTM, TinySTM

IWIR Invisible writeInvisible read

WSTM, TL2

CTR Commit-time relaxation

TSTM

RTR Real-time relaxation

SSTM

Formalizing Workload as an InputEvents (i.e., an alphabet):si: start event of transaction iwx

i: write request of transaction i on location xrx

i: read request of transaction i on location xπ(x)

i: any event of transaction i (on location x)ci: commit request of transaction i

An input pattern is a totally ordered set of events (i.e., a word)An input class is a set of input patterns (i.e., a language):

| represents the choice (e.g., “a | b” means “a” or “b”)* represents the Kleene closure (e.g., “a*” means “ε|a|aa|…”)¬ represents the complement (e.g., “¬a” means “any event but a”)

Input Acceptance Upper-bound of VWIR

Theorem. There is no VWIR design that accepts the following input class:

C2 = π (r∗ xi ¬ci w∗ x

j ¬ci c∗ j | wxj ¬cj r∗ x

i) π . ∗

Input Acceptance Upper-bound of VWIR

Theorem. There is no VWIR design that accepts the following input class:

C2 = π (r∗ xi ¬ci w∗ x

j ¬ci c∗ j | wxj ¬cj r∗ x

i) π . ∗

BEGIN_TX W(x)

END_TX

BEGIN_TX R(x)END_TX

Going furtherOther classes:

C 1 = π (π∗ xi ¬ci w∗ x

j | wxj ¬cj π∗ x

i) π∗C 3 = π (r∗ x

i ¬ci w∗ xj | wx

j ¬cj r∗ xi ) ¬ci c∗ j π ∗

C 4 = (¬wx) r∗ xi ¬ci w∗ x

j ¬ci c∗ j ¬ci s∗ k ¬(ci |ck|rxk) w∗ y

k ¬(ci |ck | rx

k ) c∗ k ¬ci r∗ yi π ∗

Other impossibility results:Theorem 1. VWVR design does not accept input class C1.Theorem 3. IWIR design does not accept input class C3.Theorem 4. CTR design does not accept input class C4.

Input Acceptance Classification

VWVR(e.g. SXM)

~C1


VWIR(e.g., DSTM, TinySTM)

VWVR(e.g. SXM)~C2

~C1


IWIR (e.g., WSTMTL2)

VWVR(e.g. SXM)

~C3~C2

~C1



CTR(e.g., TSTM)


VWVR(e.g. SXM)

~C4~C3

~C2~C1



RTR(e.g., SSTM)

CTR(e.g., TSTM)


VWVR(e.g. SXM)

~C5~C4

~C3~C2

~C1

Serializable STM needs to track all conflicts


C5 = Ø

Experimental Validation: Scalability

20% Update operations: 10% linked-list insert, 10% linked-list delete80% Other operations: linked-list containsDual quad-core Intel Xeon

Roadmap

1. Motivations


3. Problem

4. Input Acceptance


6. Conclusion

Software Transactional Memories

• TinySTM, LSA-STM, SSTM, SwissTM: efficient?

zzyy tt

insert(x)/search(z)

xx

hh



zzyy tt

insert(x)/search(z)

xx

hh



search(z) insert(x)



Both transactions cannot commit, because read/write atomicity is violated even though linked list linearizability is guaranteed.

zzyy tt

insert(x)/search(z)

xx

hh



search(z) insert(x)

Elastic Transactional Memory (ε-STM)

• Elastic transactions: weaker than normal ones

The goal is to cut transactions into sub-parts

zzyy tt

insert(x)/search(z)

xx

hh


search(z) insert(x)




It is cut in 2 parts w/ resp. ops π(x,*) and π(y,*) if:- there are no two writes on x and y between. - all writes are in the same part;- the first op of any part is a read;



search(z) insert(x)

BEGIN_EL_TX R(h) R(y) R(z)END_TX

BEGIN_EL_TX … W(h)END_TX

search(z) insert(x)

Cut



The key idea is that when reading element e:• the predecessor has not changed since it has been read• or e has not changed since the predecessor has been read.

This ensures that the parsing is always consistent although atomicity is relaxed.

zzyy tt

insert(x)/search(z)

xx

hh


• Elastic transactions: – Weaker than normal ones (cannot implement sum)– Compatible with normal ones (retain simplicity)



• Apply to various search structures:– Red-black tree, skip list, hash table…



• Apply to various search structures:– Red-black tree, skip list, hash table…

• Could be applied to counter increment transactions as well …and others?

μBenchmarks (5% insert, 5% delete, 90% search)

(HT w/ 256 buckets)

μBenchmarks (Cont’d.)(10% move, 10% sum, 80% search)

(5% insert, 5% delete, 90% search)

Conclusion

• Transactional Memory is promisingly simple• But its efficiency can be improved:

– By increasing Input Acceptance;– By weakening Transactional Model.

Conclusion



• Input Acceptance: – Maximal input acceptance is not practical – The best tradeoff (input acceptance vs. Practicality) is an open

question.

Conclusion



• Input Acceptance: – Maximal input acceptance is not practical – The best tradeoff (input acceptance vs. Practicality) is an open

question.

• Elastic transactions:– Allow more concurrency that locking techniques.– We should characterizes all their applications.

Related Work• Permissiveness [Guerraoui et al. DISC 2008]:

– Indicates the variety of output/history– Does not depend on the input



• Open Nesting [E. Moss, WMPI 2006]: – Each sub-transaction commits independently from its parent

transaction(s)– Complex roll-back mechanism [Ni et al. PPoPP’07]





• Early Release [Herlihy et al. PODC 2003]:– Some reads may be forgotten (removed from r-set)– Programmer has to decide which/when objects can be released cannot

be automatic [Harris et al. TRANSACT 2007]





• Early Release [Herlihy et al. PODC 2003]:– Some reads may be forgotten (removed from r-set)– Programmer has to decide which/when objects can be released cannot

be automatic [Harris et al. TRANSACT 2007]

• Transactional Boosting [Herlihy et al, PPoPP 2007]:– Transforms linearizable objects into transactional objects– Requires to define commutative and inverted operations

Thank you

• On the Input Acceptance of Transactional Memory,

Parallel Processing Letters, dec. 2009

• Elastic TransactionsEPFL Technical Report - LPD-REPORT-2009-002

Documents

Two Ways of Speeding Up Transactional Memory Algorithms Vincent Gramoli Joint work with Pascal Felber, Rachid Guerraoui, Derin Harmanci