Adaptive Thread Scheduling Techniques for Improving Scalability of Software Transactional Memory

Adaptive Thread Scheduling Techniques for Improving Scalability of Software

Transactional Memory

Kinson Chan, King Tin Lam, Cho-Li Wang

Presenter: Kinson ChanDate: 16 February 2010

PDCN 2011, Innsbruck, Austria

DEPARTMENT OF COMPUTER SCIENCETHE UNIVERSITY OF HONG KONG

Outline

• Motivation –‣ hardware trend and software transactional memory

• Background –‣ performance scalability‣ ratio-based concurrency control and its myth

• Solution –‣ our rate-based heuristic, Probe.

• Evaluation –‣ performance comparison

2

MotivationWhat is the current computing hardware trend,

and why is software transactional memory relevant?

Hardware trend: multicores

• Multicore processors‣ a.k.a. chip multiprocessing‣ multiple cores on a processor die‣ cores share a common cache‣ faster data sharing among threads‣ more running threads per cabinet

• Chip Multithreading‣ e.g. hyperthreading, coolthreads‣ more than one threads per core‣ hide the data load latency

4

L1! L1! L1! L1!

L2! L2! L2! L2!

L3!

1! 2! 3! 4! 5! 6! 7! 8!

a typical modern processor




4

L1! L1! L1! L1!

L2! L2! L2! L2!

L3!

1! 2! 3! 4! 5! 6! 7! 8!

Multiple cores





4

L1! L1! L1! L1!

L2! L2! L2! L2!

L3!

1! 2! 3! 4! 5! 6! 7! 8!

Mutli-threadper core


Now and future multicores

5

Micro-architecture Clock rate Cores Threads per

coreThreads per

package Shared cache Memory arrangement

IBM Power 7 ~ 3 GHz 4 ~ 8 4 32 Max 4 MBshared L3 NUMA

Sun Niagara2 1.2 ~ 1.6 GHz 4 ~ 8 8 64 Max 4 MBshared L2 NUMA

Intel Westmere ~ 2 GHz 4 ~ 8 2 16 Max 12 ~ 24 MB

shared L3 NUMA

Intel Harpertown ~ 3 GHz 2 x 2 2 8 2 x 6 MB

shared L3 UMA

AMD Bulldozer ~ 2 GHz 2 x 6 ~ 2 x 8 1 16 Max 8 MB

shared L3 NUMA

AMD Magny-Cours ~ 3 GHz 8 modules 2 per module 16 Max 8 MB

shared L3 NUMA

Intel Terascale ~ 4 GHz 80? 1? 80? 80 x 2 KBdist. cache NUCA

Now and future multicores

5

Micro-architecture Clock rate Cores Threads per

coreThreads per

package Shared cache Memory arrangement

IBM Power 7 ~ 3 GHz 4 ~ 8 4 32 Max 4 MBshared L3 NUMA

Sun Niagara2 1.2 ~ 1.6 GHz 4 ~ 8 8 64 Max 4 MBshared L2 NUMA

Intel Westmere ~ 2 GHz 4 ~ 8 2 16 Max 12 ~ 24 MB

shared L3 NUMA

Intel Harpertown ~ 3 GHz 2 x 2 2 8 2 x 6 MB

shared L3 UMA

AMD Bulldozer ~ 2 GHz 2 x 6 ~ 2 x 8 1 16 Max 8 MB

shared L3 NUMA

AMD Magny-Cours ~ 3 GHz 8 modules 2 per module 16 Max 8 MB

shared L3 NUMA

Intel Terascale ~ 4 GHz 80? 1? 80? 80 x 2 KBdist. cache NUCA

How can we scale our program to have these many threads?

Multi-threading and synchronization

6


6

Coarse grain locking

Easy / Correct(few locks, predictable)

Difficult to scale(excessive mutual

exclusion)


6




exclusion)

Fine-grain locking

Error prone(deadlock, forget to

lock, ...)

Scales better(allows more parallelism)


6




exclusion)

Fine-grain locking

Error prone(deadlock, forget to

lock, ...)

Scales better(allows more parallelism)

Do we have anything in between?

Easy / Correct

Scales good

STM optimistic execution

7

Begin Begin

Proceed Proceed

Commit Commit

Retry

Commit

x=x+4

y=y-4

x=x+2

y=y-2

Begin

Proceed Begin

Proceed

Commit

Commit

x=x+4

y=y-4

w=w+5

z=w

Thread 1 Thread 2 Thread 3 Thread 1 Thread 2 Thread 3

x=x+2

y=y-2

Success

Success

Success

Success

conflict detection

conflict detection

begin; x=x+4; y=y-4; commit;



begin; w=w+5; z=w; commit;

begin;x = x + 4;y = y - 4;commit;


7

Begin Begin

Proceed Proceed

Commit Commit

Retry

Commit

x=x+4

y=y-4

x=x+2

y=y-2

Begin

Proceed Begin

Proceed

Commit

Commit

x=x+4

y=y-4

w=w+5

z=w


x=x+2

y=y-2

Success

Success

Success

Success

conflict detection

conflict detection






7

Begin Begin

Proceed Proceed

Commit Commit

Retry

Commit

x=x+4

y=y-4

x=x+2

y=y-2

Begin

Proceed Begin

Proceed

Commit

Commit

x=x+4

y=y-4

w=w+5

z=w


x=x+2

y=y-2

Success

Success

Success

Success

conflict detection

conflict detection





case 1: two transactions conflicts:rollback and retry one of them.


7

Begin Begin

Proceed Proceed

Commit Commit

Retry

Commit

x=x+4

y=y-4

x=x+2

y=y-2

Begin

Proceed Begin

Proceed

Commit

Commit

x=x+4

y=y-4

w=w+5

z=w


x=x+2

y=y-2

Success

Success

Success

Success

conflict detection

conflict detection





case 1: two transactions conflicts:rollback and retry one of them.

case 2: two transactions do not conflict: they execute together,

achieving better parallelism.

C. J. Rossbach, O. S. Hofmann and Emmett Witchel, Is transactional programming actually easier, In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 45–56, 2010.

STM is easy

• In the University of Texas at Austin, 237 students taking Operating System courses were instructed to program the same problem with coarse locks, fine-grained locks, monitors and transactions...

8

DevelopmentTime

Errors

CodeComplexity


STM is easy


8

DevelopmentTime

Errors

CodeComplexity

LongShort

TMCoarse Fine


STM is easy


8

DevelopmentTime

Errors

CodeComplexity

LongShort

TMCoarse Fine

Simple Complex

TMCoarse Fine


STM is easy


8

DevelopmentTime

Errors

CodeComplexity

LongShort

TMCoarse Fine

Simple Complex

TMCoarse Fine

Less More

TM Coarse Fine

STM is a research toy

9


9

STM

SXMOSTM

DSTM

ASTM

TL2

TinySTM

TLRW

SwissTM

NOrec

TML

RingTM

InvalTM

DeuceTM

D2STM


9

STM

SXMOSTM

DSTM

ASTM

TL2

TinySTM

TLRW

SwissTM

NOrec

TML

RingTM

InvalTM

DeuceTM

D2STM

not

Company Products ResearchSun Dynamic STM library DSTM, TL2, TLRW,

Rock processor, ...

Intel Intel C++ STM compiler McRT-STM, ...

IBM C/C++ for TM on AIX STM extension on X10, ...

Microsoft STM.NET STM on Haskell, ...

AMD ASF instruction set extension

BackgroundWhat affects the transactional memory performance?

How can we adjust concurrency for best performance?

Threads and performance

11


11

thread#

attempt#

more threads, more transactional

attempts


11

×thread#

attempt#


attempts

thread#

comm

it prob.

more threads, smaller portion of

transactions to commit


11

× ∝thread#

attempt#


attempts

thread#

comm

it prob.



thread#

performance

concave curve of performance


11

× ∝thread#

attempt#


attempts

thread#

comm

it prob.



thread#

performance

concave curve of performance

optimal

Ratio- vs rate-based concurrency controls

12


12

• Concurrency control in STM:‣ achieve optimal performance by scheduling means.


12


• Different concepts:


12


• Different concepts:‣ commit ratio-based heuristics


12



✴ ratio = commits / (commits + aborts)


12



✴ ratio = commits / (commits + aborts)✴ reduce concurrency when ratio gets too low✴ relax concurrency when ratio gets higher than a threshold


12




‣ commit rate-based heuristics


12




‣ commit rate-based heuristics✴ rate = commits / time


12




‣ commit rate-based heuristics✴ rate = commits / time

‣ queuing after winner transactions✴ kernel-level programming, conditional waiting...

Ratio-based solutions

• Ansari, et al.:‣ introduces total commit [ratio] (TCR)‣ increases / decreases threads by comparing TCR and set-point (70%)

• Yoo and Lee:‣ introduces per-thread contention intensity (CI)‣ (likelihood of a thread to encounter contentions)‣ stalls for acquiring mutex when CI goes above a value (70%)

• Dolev, et al.:‣ activates hotspot detection when CI goes above a value (40%)‣ a thread stalls for acquiring mutex when hotspot is detected

13

14

Myths of ratio-based heuristics

• We want an application finishes faster‣ i.e. more transactions committed per unit time‣ (assumption: constant number of transactions)

• High commit ratio ≠ high performance‣ 1 thread + 100% ratio vs 4 threads + 40% commit ratio‣ engine rotation ≠ vehicle velocity

• Watching commit ratio is an inexact science‣ happens to be close estimation, though

• Drawbacks‣ over-serialization when commit ratio is low‣ over-relaxation when the commit ratio is high

14

Myths of ratio-based heuristics

• We want an application finishes faster‣ i.e. more transactions committed per unit time‣ (assumption: constant number of transactions)

• High commit ratio ≠ high performance‣ 1 thread + 100% ratio vs 4 threads + 40% commit ratio‣ engine rotation ≠ vehicle velocity

• Watching commit ratio is an inexact science‣ happens to be close estimation, though

• Drawbacks‣ over-serialization when commit ratio is low‣ over-relaxation when the commit ratio is high

• At any instance, we can only pick a thread count‣ “what if” questions not allowed in run-time

• What is high and what is low?‣ commit ratio goes between 0% and 100%‣ commit rate depends on transaction lengths

• Changing patterns‣ transaction nature may change along execution:‣ getting longer / shorter‣ getting more / less contentions

• Pre-defined bounds not acceptable‣ the optimal spot changes across execution timeline.

15

Challenges with rate-based solution

16

Commit ratio vs Commit rateC

omm

it R

atio

/ %

(G

reen

Lin

e)!

Com

mit

Rat

e (R

ed L

ine)!

16


omm

it R

atio

/ %

(G

reen

Lin

e)!

Com

mit

Rat

e (R

ed L

ine)!

commit ratio

16


omm

it R

atio

/ %

(G

reen

Lin

e)!

Com

mit

Rat

e (R

ed L

ine)!

commit rate

16


omm

it R

atio

/ %

(G

reen

Lin

e)!

Com

mit

Rat

e (R

ed L

ine)!

More threads results in better

performance

16


omm

it R

atio

/ %

(G

reen

Lin

e)!

Com

mit

Rat

e (R

ed L

ine)!

Excessive threads kills performance

16


omm

it R

atio

/ %

(G

reen

Lin

e)!

Com

mit

Rat

e (R

ed L

ine)!

Changing application

natures

16


omm

it R

atio

/ %

(G

reen

Lin

e)!

Com

mit

Rat

e (R

ed L

ine)!

Excessive threads yields fluctuations

16


omm

it R

atio

/ %

(G

reen

Lin

e)!

Com

mit

Rat

e (R

ed L

ine)!

Dropping ratioIncreasing rateShorter time

16


omm

it R

atio

/ %

(G

reen

Lin

e)!

Com

mit

Rat

e (R

ed L

ine)!

Low commit ratio, but still

scalable...

16


omm

it R

atio

/ %

(G

reen

Lin

e)!

Com

mit

Rat

e (R

ed L

ine)!

SolutionNow we know ratio-based solutions are not right.

What shall we do for a rate-based alternative?

Counting variables

18

commits:number of commits in a time-slice

aborts:number of aborts in a time-slice

quota:maximum concurrency

entered:currently active transactions

peak:peak concurrency in a time-slice

Counting variables

18






commits per timeslice:commit rate

Counting variables

18






commit ratio

Counting variables

18






quota ≤ entered:no more new transactions

stall new comers with pthread_sched();

Counting variables

18






quota ≥ peak:unused quota

reduce quota for a tight limit...

Counting variables

18






System architecture

19

Threadcreation

Threadscheduling

Memorymanagement

Input / outputsystem

Concurrencycontrol unit

Conflictdetectionengine

Shared memory

Activitylogger

Txn

Activitylogger

Txn

Activitylogger

Txn

Activitylogger

Txn

User code

Transactionalthreads

Transactionalmemorysystem

Operatingsystem

System architecture

19

Threadcreation

Threadscheduling

Memorymanagement




Shared memory

Activitylogger

Txn

Activitylogger

Txn

Activitylogger

Txn

Activitylogger

Txn

User code

Transactions execute as normal,

with conflict detection



Operatingsystem

System architecture

19

Threadcreation

Threadscheduling

Memorymanagement




Shared memory

Activitylogger

Txn

Activitylogger

Txn

Activitylogger

Txn

Activitylogger

Txn

User code

Concurrency control unit is

added as a hook, monitoring the performance



Operatingsystem

System architecture

19

Threadcreation

Threadscheduling

Memorymanagement




Shared memory

Activitylogger

Txn

Activitylogger

Txn

Activitylogger

Txn

Activitylogger

Txn

User code

Scheduler is invoked to stall

some new transactions, if appropriate



Operatingsystem

System architecture

19

Threadcreation

Threadscheduling

Memorymanagement




Shared memory

Activitylogger

Txn

Activitylogger

Txn

Activitylogger

Txn

Activitylogger

Txn

User code



Operatingsystem

Art of uninformed climbing

20

comm

it rate

active threads

direction: left


20

comm

it rate

active threads

direction: left


20

comm

it rate

active threads

direction: left


20

comm

it rate

active threads

direction: left


20

comm

it rate

active threads

direction: left


20

comm

it rate

active threads

direction: left


20

comm

it rate

active threads

direction: left


20

comm

it rate

active threads

direction: left


20

comm

it rate

active threads

direction: left


20

comm

it rate

active threads

direction: left

decreasing rate! change direction!


20

comm

it rate

active threads

direction: right


20

comm

it rate

active threads

direction: right


20

comm

it rate

active threads

direction: right


20

comm

it rate

active threads

direction: right

decreasing rate! change direction!


20

comm

it rate

active threads

direction: left


20

comm

it rate

active threads

direction: left

resultant probingregion

Performance EvaluationHow does our rate-based solution

compares with other ratio-based ones?

Evaluation Platform

• Dell PowerEdge M610 Blade Server‣ 2x Intel “Nehalem” Xeon E5540 2.53 GHz (8 cores, 16 threads)‣ ECC DDR3-1066 36 GB main memory

• STAMP Benchmark‣ original from Stanford University – http://stamp.stanford.edu/‣ modified version for TinySTM: http://www.tinystm.org/

• TinySTM 0.9.5‣ open-source version – http://www.tinystm.org/

• Yoo’s and Shrink concurrency control‣ from EPFL Distributed Programming Laboratory –

http://lpd.epfl.ch/site/research/tmeval/

22

http://stamp.stanford.edu

http://stamp.stanford.edu

http://www.tinystm.org






Probing in effect vs other heuristics

23

throttle2 probe2 basic yoo shrink

kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)

Figure 3. Commit Ratio, Commit Rate and Number of Stalled Threads of Some TM Applications

We have found Probe more favourable than Throttle, as wellas two other concurrency control policies. We have also foundour solutions are sensitive to the cache sharing, and refined ourimplementation accordingly.

In future we may consider new adaptive scheduling policiesthat observe more STM parameters for smarter decisions, includingtransaction length, read-write ratio, etc.

We also noticed that modern word-based STM come with acommon logical clock, which is atomically incremented every timea transaction commits. We believe this is a potential cache bottle-neck and we foresee a new STM protocol in the future.

AcknowledgmentsThis research was supported by XXX grant.

References[1] M. Ansari, C. Kotselidis, K. Jarvis, M. Lujan, C. Kirkham, and I. Wat-

son. Adaptive concurrency control for transactional memory. In Pro-ceedings of the 1st workshop on Programmability Issues for Multi-Core Computers, 2008.

[2] M. Ansari, C. Kotselidis, K. Jarvis, M. Lujan, and I. Watson. Ad-vanced concurrency control for transactional memory using transac-tion commit rate. In Proceedings of the 32rd International Conferenceon Distributed Computing Systems, pages 522–529, 2008.

[3] D. Dice, O. Shalev, and N. Shavit. Transactional locking ii. In Pro-ceedings of the 20th International Symposium on Distributed Comput-ing, pages 194–208, 2006.

[4] S. Dolev, D. Hendler, and A. Suissa. Car-stm: Scheduling-based col-lision avoidance and resolution for software transactional memory. InProceedings of the 27th ACM Symposium on Principles of DistributedComputing, pages 125–134, 2008.

[5] A. Dragojevic, A. Guerraoui, R. amd Singh, and V. Singh. Prevent-ing versus curing: Avoiding conflicts in transactional memories. InProceedings of the 28th ACM Symposium on Principles of DistributedComputing, pages 7–16, 2009.

[6] A. Dragojevic, R. Guerraoui, and M. Kapalka. Stretching transactionalmemory. In Proceedings of the 13th ACM SIGPLAN Symposium onProgramming Language Design and Implementation, pages 155–165,2009.

[7] R. Ennals. Software transactional memory should not be obstruction-free. Intel Research Cambridge, 2005.

[8] P. Felber, C. Fetzer, and T. Riegel. Dynamic performance tuning ofword-based software transactional memory. In Proceedings of the 13thACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming, pages 237–246, 2008.

[9] M. Herlihy and J. Moss. Transactional memory: Architectural supportfor lock-free data structures. In Proceedings of the 20th annual

International Symposium on Computer Architecture, pages 289–300,1993.

[10] M. Herlihy, V. Luchangco, and M. Moir. Obstruction free synchroniza-tion: Double ended queues as an example. In Proceedings of the 23rdInternational Conference on Distributed Computing Systems, pages522–529, 2003.

[11] LPD-EPFL. How good is a transactional memory implementation.http://lpd.epfl.ch/site/research/tmeval, accessed on 30 July 2010.

[12] W. Maldonado, P. Marlier, P. Felber, A. Suissa, D. Hendler, A. Fe-dorova, J. Lawall, and G. Muller. Scheduling support for transactionalmemory contention management. In Proceedings of the 15th ACMSIGPLAN Symposium on Principles and Practice of Parallel Program-ming, pages 79–90, 2010.

[13] V. Marathe, M. Spear, C. Heriot, A. Acharya, D. Eisenstat,W. Scherer III, and M. Scott. Toward high performance nonblockingsoftware transactional memory. In Proceedings of the 1st ACM SIG-PLAN Workshop on Languages, Compilers, and Hardware Support forTransactional Computing, 2006.

[14] C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun. Stamp: Standfordtransactional applciations for multi-processing. In Proceedings of the2008 IEEE International Symposium on Workload Characterization,pages 35–46, 2008.

[15] M. OGara. The intel roadmap: The shift to multicore is an inflectionpoint for software design philosophy. 2005. http://opensource.sys-con.com/node/48477, accessed on 28 July 2010.

[16] H. Ramadan, C. Rossbach, and E. Witchel. Dependance-aware trans-actional memory for increased concurrency. In Proceedings of the41st IEEE/ACM International Symposium on Microarchitecture, pages246–257, 2008.

[17] W. Scherer III and M. Scott. Contention management in dynamicsoftware transactional memory. In Proceedings of the 24th ACMSymposium on Principles of Distributed Computing, pages 240–248,2005.

[18] N. Shavit and D. Touitou. Software transactional memory. In Pro-ceedings of the 14th ACM Symposium on Principles of DistributedComputing, pages 204–213, 1995.

[19] T. Usui, Y. Smaragdakis, and R. Behrends. Adaptive locks: Combiningtransactions and locks for efficient concurrency. In Proceedings of the27th ACM Symposium on Principles of Distributed Computing, 2009.

[20] R. Yoo and H. Lee. Adaptive transaction scheduling for transactionalmemory systems. In Proceedings of the 20th ACM Symposium onParallelism in Algorithms and Architectures, pages 169–178, 2009.

9 2010/8/3


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3

dolev


23


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3

commit ratio

dolev


23


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3

commit rate

dolev


23


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3

threads stalled

dolev


23


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3

low commit ratio

OriginalTinySTM

dolev


23


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3

mild adjustment

mild adjustment

Probe: rate-basedconcurrency control

dolev


23


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3

higher commit rate

higher commit rate

Probe: rate-basedconcurrency control

dolev


23


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3

aim: high ratio

aim: high ratio

dolevYoo’s and Dolev’sratio-based concurrency control


23


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3

large adjustment

large adjustment



23


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3

even lower rate...

even lower rate...



23


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3


kmeans-216 threads

yada8 threads

100

75

50

0

25

100

75

50

0

25

680K

510K

340K

0

170K

1.6M

1.2M

800K

0

400K

4

3

2

0

1

16

12

8

0

4

Com

mit

R�t��

(Gre

en D

otte

d Li

ne) /

%

Com

mit

Rat

e (R

ed S

olid

Lin

e) /

Tran

sact

ions

per

Sec

ond

Stal

led

Thre

ads

(Blu

e D

ashe

d Li

ne)




























9 2010/8/3

dolev

Yoo’s Dolev’s Throttle Probe Probe2

yada

vacation-1

kmeans-2

vacation-2

kmeans-1

intruder

genome

ssca

labyrinth

average

+70.5% +61.02% +81.33% +101.79% +81.59%

+38.14% +18.39% +35.68% +54.90% +66.26%

-6.50% -19.56% -11.38% +2.52% +31.59%

+3.19% -19.04% +6.63% +2.42% +8.81%

-1.45% -28.96% -24.49% -1.38% +8.19%

-12.95% -28.03% -35.13% -14.65% +5.26%

+0.46% -11.80% -4.73% -7.55% -1.01%

+0.58% +0.29% -25.08% -26.51% -4.62%

-2.37% -1.34% -27.66% -12.22% -6.73%

+9.96% -3.23% -0.54% +11.03% +21.04%

Performance comparison

24


yada

vacation-1

kmeans-2

vacation-2

kmeans-1

intruder

genome

ssca

labyrinth

average

+70.5% +61.02% +81.33% +101.79% +81.59%

+38.14% +18.39% +35.68% +54.90% +66.26%

-6.50% -19.56% -11.38% +2.52% +31.59%

+3.19% -19.04% +6.63% +2.42% +8.81%

-1.45% -28.96% -24.49% -1.38% +8.19%

-12.95% -28.03% -35.13% -14.65% +5.26%

+0.46% -11.80% -4.73% -7.55% -1.01%

+0.58% +0.29% -25.08% -26.51% -4.62%

-2.37% -1.34% -27.66% -12.22% -6.73%

+9.96% -3.23% -0.54% +11.03% +21.04%


24

Ratio-based Heuristics


yada

vacation-1

kmeans-2

vacation-2

kmeans-1

intruder

genome

ssca

labyrinth

average

+70.5% +61.02% +81.33% +101.79% +81.59%

+38.14% +18.39% +35.68% +54.90% +66.26%

-6.50% -19.56% -11.38% +2.52% +31.59%

+3.19% -19.04% +6.63% +2.42% +8.81%

-1.45% -28.96% -24.49% -1.38% +8.19%

-12.95% -28.03% -35.13% -14.65% +5.26%

+0.46% -11.80% -4.73% -7.55% -1.01%

+0.58% +0.29% -25.08% -26.51% -4.62%

-2.37% -1.34% -27.66% -12.22% -6.73%

+9.96% -3.23% -0.54% +11.03% +21.04%


24

Ratio-based Heuristics

Naive: scales for 25% ~ 75%

of ratio


yada

vacation-1

kmeans-2

vacation-2

kmeans-1

intruder

genome

ssca

labyrinth

average

+70.5% +61.02% +81.33% +101.79% +81.59%

+38.14% +18.39% +35.68% +54.90% +66.26%

-6.50% -19.56% -11.38% +2.52% +31.59%

+3.19% -19.04% +6.63% +2.42% +8.81%

-1.45% -28.96% -24.49% -1.38% +8.19%

-12.95% -28.03% -35.13% -14.65% +5.26%

+0.46% -11.80% -4.73% -7.55% -1.01%

+0.58% +0.29% -25.08% -26.51% -4.62%

-2.37% -1.34% -27.66% -12.22% -6.73%

+9.96% -3.23% -0.54% +11.03% +21.04%


24

Rate-based Heuristics


yada

vacation-1

kmeans-2

vacation-2

kmeans-1

intruder

genome

ssca

labyrinth

average

+70.5% +61.02% +81.33% +101.79% +81.59%

+38.14% +18.39% +35.68% +54.90% +66.26%

-6.50% -19.56% -11.38% +2.52% +31.59%

+3.19% -19.04% +6.63% +2.42% +8.81%

-1.45% -28.96% -24.49% -1.38% +8.19%

-12.95% -28.03% -35.13% -14.65% +5.26%

+0.46% -11.80% -4.73% -7.55% -1.01%

+0.58% +0.29% -25.08% -26.51% -4.62%

-2.37% -1.34% -27.66% -12.22% -6.73%

+9.96% -3.23% -0.54% +11.03% +21.04%


24

Rate-based Heuristics

Better number counting

strategy


yada

vacation-1

kmeans-2

vacation-2

kmeans-1

intruder

genome

ssca

labyrinth

average

+70.5% +61.02% +81.33% +101.79% +81.59%

+38.14% +18.39% +35.68% +54.90% +66.26%

-6.50% -19.56% -11.38% +2.52% +31.59%

+3.19% -19.04% +6.63% +2.42% +8.81%

-1.45% -28.96% -24.49% -1.38% +8.19%

-12.95% -28.03% -35.13% -14.65% +5.26%

+0.46% -11.80% -4.73% -7.55% -1.01%

+0.58% +0.29% -25.08% -26.51% -4.62%

-2.37% -1.34% -27.66% -12.22% -6.73%

+9.96% -3.23% -0.54% +11.03% +21.04%


24


yada

vacation-1

kmeans-2

vacation-2

kmeans-1

intruder

genome

ssca

labyrinth

average

+70.5% +61.02% +81.33% +101.79% +81.59%

+38.14% +18.39% +35.68% +54.90% +66.26%

-6.50% -19.56% -11.38% +2.52% +31.59%

+3.19% -19.04% +6.63% +2.42% +8.81%

-1.45% -28.96% -24.49% -1.38% +8.19%

-12.95% -28.03% -35.13% -14.65% +5.26%

+0.46% -11.80% -4.73% -7.55% -1.01%

+0.58% +0.29% -25.08% -26.51% -4.62%

-2.37% -1.34% -27.66% -12.22% -6.73%

+9.96% -3.23% -0.54% +11.03% +21.04%


24

over-relaxation


yada

vacation-1

kmeans-2

vacation-2

kmeans-1

intruder

genome

ssca

labyrinth

average

+70.5% +61.02% +81.33% +101.79% +81.59%

+38.14% +18.39% +35.68% +54.90% +66.26%

-6.50% -19.56% -11.38% +2.52% +31.59%

+3.19% -19.04% +6.63% +2.42% +8.81%

-1.45% -28.96% -24.49% -1.38% +8.19%

-12.95% -28.03% -35.13% -14.65% +5.26%

+0.46% -11.80% -4.73% -7.55% -1.01%

+0.58% +0.29% -25.08% -26.51% -4.62%

-2.37% -1.34% -27.66% -12.22% -6.73%

+9.96% -3.23% -0.54% +11.03% +21.04%


24

over-reaction


yada

vacation-1

kmeans-2

vacation-2

kmeans-1

intruder

genome

ssca

labyrinth

average

+70.5% +61.02% +81.33% +101.79% +81.59%

+38.14% +18.39% +35.68% +54.90% +66.26%

-6.50% -19.56% -11.38% +2.52% +31.59%

+3.19% -19.04% +6.63% +2.42% +8.81%

-1.45% -28.96% -24.49% -1.38% +8.19%

-12.95% -28.03% -35.13% -14.65% +5.26%

+0.46% -11.80% -4.73% -7.55% -1.01%

+0.58% +0.29% -25.08% -26.51% -4.62%

-2.37% -1.34% -27.66% -12.22% -6.73%

+9.96% -3.23% -0.54% +11.03% +21.04%


24

Conclusions

• Trend of multicore urges us to write parallel computer programs

• Software transactional memory is part of future computation‣ easier to program, less errors, neat code‣ but it needs concurrency control for the best performance

• Ratio-based vs rate-based concurrency heuristics‣ ratio-based heuristics are inexact approximations‣ watching ratio only causes over-reaction / over-relaxation

• Rate-based concurrency heuristics, Probe, outperforms

25

Thank You!Contact me:[email protected]

HKU CS Systems Research Group:http://www.srg.cs.hku.hk/

Dr. Cho-Li Wang’s Webpage:http://www.cs.hku.hk/~clwang/

mailto:[email protected]

mailto:[email protected]

http://www.srg.cs.hku.hk

http://www.srg.cs.hku.hk

http://www.cs.hku.hk/~clwang/

http://www.cs.hku.hk/~clwang/