Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Chris Colohan1, Anastassia Ailamaki1,J. Gregory Steffan2 and Todd C. Mowry1,3

1Carnegie Mellon University2University of Toronto

3Intel Research Pittsburgh

2

Chip Multiprocessors are Here!

2 cores now, soon will have 4, 8, 16, or 32 Multiple threads per core How do we best use them?

IBM Power 5

AMD Opteron

Intel Yonah

3

Multi-Core Enhances Throughput

Database ServerUsers

Transactions DBMS Database

Cores can run concurrent transactions and improve

throughput

Cores can run concurrent transactions and improve

throughput

4

Using Multiple CoresDatabase ServerUsers

Transactions DBMS Database

Can multiple cores improvetransaction latency?

Can multiple cores improvetransaction latency?

5

Parallelizing transactions

SELECT cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line;}

DBMS

Intra-query parallelism Used for long-running queries (decision support) Does not work for short queries

Short queries dominate in commercial workloads

6



DBMS

Intra-transaction parallelism Each thread spans multiple queries

Hard to add to existing systems! Need to change interface, add latches and locks,

worry about correctness of parallel execution…

7



DBMS

Intra-transaction parallelism Breaks transaction into threads

Hard to add to existing systems! Need to change interface, add latches and locks,

worry about correctness of parallel execution…

Thread Level Speculation (TLS)makes parallelization easier.

Thread Level Speculation (TLS)makes parallelization easier.

8

Thread Level Speculation (TLS)

*p=

*q=

=*p

=*q

Sequential

Tim

e

Parallel

*p=

*q=

=*p

=*q

=*p

=*q

Epoch 1 Epoch 2

9


*p=

*q=

=*p

=*q

Sequential

Tim

e

*p=

*q=

=*p

R2

Violation!

=*p

=*q

Parallel

Use epochs

Detect violations Restart to

recover Buffer state

Oldest epoch: Never restarts No buffering

Worst case: Sequential

Best case: Fully parallelData dependences limit performance.Data dependences limit performance.

Epoch 1 Epoch 2

10

TransactionProgrammer

DBMS Programmer

Hardware Developer

A Coordinated Effort

Choose epoch boundaries

Choose epoch boundaries

Remove performance bottlenecks

Remove performance bottlenecks

Add TLS support to architecture

Add TLS support to architecture

11

So what’s new?

Intra-transaction parallelism Without changing the transactions With minor changes to the DBMS Without having to worry about locking Without introducing concurrency bugs With good performance

Halve transaction latency on four cores

12

Related Work

Optimistic Concurrency Control (Kung82)

Sagas (Molina&Salem87)

Transaction chopping (Shasha95)

13

Outline

Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions

14

Case Study: New Order (TPC-C)

Only dependence is the quantity field Very unlikely to occur (1/100,000)

GET cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line;}

78% of transactionexecution time

15

Case Study: New Order (TPC-C)GET cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line;}

GET cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;

TLS_foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line;}

GET cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;

TLS_foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line;}

16

Outline

Introduction Related work Dividing transactions into epochs Removing bottlenecks in the

DBMS Results Conclusions

17

Dependences in DBMSTim

e

18

Dependences in DBMSTim

e

Dependences serialize execution!

Example: statistics gathering pages_pinned++ TLS maintains serial ordering of

increments To remove, use per-CPU counters

Performance tuning: Profile execution Remove bottleneck dependence Repeat

19

Buffer Pool Management

CPU

Buffer Pool

get_page(5)

ref: 1

put_page(5)

ref: 0

20

get_page(5)put_page(5)


CPU

Buffer Pool

get_page(5)

ref: 0

put_page(5)

Tim

e

get_page(5)

put_page(5)

TLS ensures first epoch gets page

first.Who cares?

TLS ensures first epoch gets page

first.Who cares?

TLS maintains original load/store order

Sometimes this is not needed


21


CPU

Buffer Pool

get_page(5)

ref: 0

put_page(5)

Tim

e

get_page(5)

put_page(5)

= Escape SpeculationIsolated: undoing get_page will not affect other transactionsUndoable: have an operation (put_page) which returns the system to its initial state

Isolated: undoing get_page will not affect other transactionsUndoable: have an operation (put_page) which returns the system to its initial state

• Escape speculation• Invoke operation• Store undo function• Resume speculation

• Escape speculation• Invoke operation• Store undo function• Resume speculation


put_page(5)get_page(5)

22


CPU

Buffer Pool

get_page(5)

ref: 0

put_page(5)

Tim

e

get_page(5)

put_page(5)


Not undoable!

Not undoable!


= Escape Speculation

23


CPU

Buffer Pool

get_page(5)

ref: 0

put_page(5)

Tim

e

get_page(5)

put_page(5)

get_page(5)

put_page(5)

Delay put_page until end of epoch Avoid dependence

= Escape Speculation

24

Removing Bottleneck Dependences

We introduce three techniques: Delay operations until non-speculative

Mutex and lock acquire and release Buffer pool, memory, and cursor release Log sequence number assignment

Escape speculation Buffer pool, memory, and cursor allocation

Traditional parallelization Memory allocation, cursor pool, error

checks, false sharing

25

Outline

Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions

26

Experimental Setup

Detailed simulation Superscalar, out-of-

order, 128 entry reorder buffer

Memory hierarchy modeled in detail

TPC-C transactions on BerkeleyDB In-core database Single user Single warehouse Measure interval of 100

transactions Measuring latency not

throughput

CPU

32KB4-wayL1 $

Rest of memory system

Rest of memory system

CPU

32KB4-wayL1 $

CPU

32KB4-wayL1 $

CPU

32KB4-wayL1 $

2MB 4-way L2 $

27

Optimizing the DBMS: New Order

0

0.25

0.5

0.75

1

1.25

Seque

ntial

No Opt

imiza

tions

Latch

es

Lock

s

Mall

oc/F

ree

Buffe

r Poo

l

Curso

r Que

ue

Error C

heck

s

False

Sharin

g

B-Tre

e

Logg

ing

Tim

e (n

orm

aliz

ed)

Idle CPU

ViolatedCache Miss

Busy

Cache misses

increase

Cache misses

increase

Other CPUs not helping

Other CPUs not helping

Can’t optimize

much more

Can’t optimize

much more

26% improvemen

t

26% improvemen

t

28

Optimizing the DBMS: New Order

0

0.25

0.5

0.75

1

1.25

Seque

ntial

No Opt

imiza

tions

Latch

es

Lock

s

Mall

oc/F

ree

Buffe

r Poo

l

Curso

r Que

ue

Error C

heck

s

False

Sharin

g

B-Tre

e

Logg

ing

Tim

e (n

orm

aliz

ed)

Idle CPU

ViolatedCache Miss

Busy

This process took me 30 days and <1200 lines of code.

This process took me 30 days and <1200 lines of code.

29

Other TPC-C Transactions

0

0.25

0.5

0.75

1

New Order Delivery Stock Level Payment Order Status

Tim

e (

no

rma

lize

d)

Idle CPU

FailedCache Miss

Busy

3/5 Transactions speed up by 46-66%

30

Conclusions

TLS makes intra-transaction parallelism practical Reasonable changes to transaction,

DBMS, and hardware Halve transaction latency

31

Needed backup slides (not done yet)

2 proc. Results Shared caches may change how you

want to extract parallelism! Just have lots of transactions: no sharing TLS may have more sharing

Any questions?

For more information, see:www.colohan.com

Backup Slides Follow…

LATCHES

35

Latches

Mutual exclusion between transactions Cause violations between epochs

Read-test-write cycle RAW Not needed between epochs

TLS already provides mutual exclusion!

36

Latches: Aggressive Acquire

Acquirelatch_cnt++…work…latch_cnt--

Homefree

latch_cnt++…work…(enqueue release)

Commit worklatch_cnt--

Homefree

latch_cnt++…work…(enqueue release)

Commit worklatch_cnt--Release

Larg

e c

riti

cal secti

on

37

Latches: Lazy Acquire

Acquire…work…Release

Homefree

(enqueue acquire) …work…(enqueue release)

AcquireCommit workRelease

Homefree

(enqueue acquire)…work…(enqueue release)

AcquireCommit workRelease

Sm

all c

riti

cal secti

on

s

HARDWARE

39

TLS in Database Systems

Non-DatabaseTLS

Tim

e

TLS in DatabaseSystems

Large epochs:• More dependences

• Must tolerate

• More state• Bigger buffers

Large epochs:• More dependences

• Must tolerate

• More state• Bigger buffers

Concurrenttransactions

40

Feedback Loop I know this is

parallel!

for() { do_work();}

par_for() { do_work();}

Must…Make…Faster

think

feed back feed back feed back feed back feed back feed back feed back feed back feed back feed back feed

41

Violations == Feedback

*p=

*q=

=*p

=*q

Sequential

Tim

e

*p=

*q=

=*p

R2

Violation!

=*p

=*q

Parallel

0x0FD80xFD200x0FC00xFC18



42

Eliminating Violations

*p=

*q=

=*p

R2

Violation!

=*p

=*q

Parallel

*q==*q

=*q

Violation!

Eliminate *p Dep.

Tim

e

0x0FD80xFD200x0FC00xFC18

Optimization maymake slower?

43

Tolerating Violations: Sub-epochs

Tim

e

*q=Violation!

Sub-epochs

=*q

=*q

*q==*q

=*q

Violation!

Eliminate *p Dep.

44

Sub-epochs

Started periodically by hardware How many? When to start?

Hardware implementation Just like epochs

Use more epoch contexts No need to check

violations between sub-epochs within an epoch

*q=Violation!

Sub-epochs

=*q

=*q

45

Old TLS Design

CPU

L1 $

L2 $

Rest of memory systemRest of memory system

CPU

L1 $

CPU

L1 $

CPU

L1 $L1 $ L1 $ L1 $ L1 $

Buffer speculative state in write back L1

cache

Invalidation

Detect violations through

invalidations

Rest of system only sees committed

data

Restart by invalidating speculative lines

Problems:

• L1 cache not large enough• Later epochs only get values on commit

Problems:

• L1 cache not large enough• Later epochs only get values on commit

46

New Cache Design

CPU

L1 $

L2 $


CPU

L1 $

CPU

L1 $

CPU

L1 $

Buffer speculative and non-speculative state for all epochs

in L2

Speculative writes immediately visible

to L2 (and later epochs)

Detect violations at lookup time

Invalidation coherence between L2 caches

L1 $ L1 $ L1 $ L1 $

L2 $

Invalidation

Restart by invalidating speculative lines

47

New Features

CPU

L1 $

L2 $


CPU

L1 $

CPU

L1 $

CPU

L1 $L1 $ L1 $ L1 $ L1 $

L2 $

Speculative state in L1 and L2 cache

Cache line replication (versions)

Data dependence tracking within cache

Speculative victim cache

New

!

New

!

48

Scaling

0

0.25

0.5

0.75

1

Seque

ntial

2 CPUs

4 CPUs

8 CPUs

Tim

e (n

orm

aliz

ed)

0

0.25

0.5

0.75

1

Seque

ntial

2 CPUs

4 CPUs

8 CPUs

Modified with 50-150 items/transaction

Idle CPU

Failed Speculation

IUO Mutex Stall

Cache Miss

Instruction Execution

49

Evaluating a 4-CPU system

0

0.25

0.5

0.75

1

Seque

ntial

TLS S

eq

No Sub

-epo

ch

Baseli

ne

No Spe

culat

ion

Tim

e (n

orm

aliz

ed)

Idle CPU

Failed Speculation

IUO Mutex Stall

Cache Miss

Instruction Execution

Original benchmark run on

1 CPU

Original benchmark run on

1 CPU

Parallelized benchmark run on 1

CPU

Parallelized benchmark run on 1

CPU

Without sub-epoch support

Without sub-epoch support

Parallel execution

Parallel execution

Ignore violations (Amdahl’s Law limit)

Ignore violations (Amdahl’s Law limit)

50

Sub-epochs: How many/How big?

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Number of Sub-epochs/Instructions per Sub-epoch

Tim

e (n

orm

aliz

ed)

• Supporting more sub-epochs is better• Spacing depends on location of violations

• Even spacing is good enough

• Supporting more sub-epochs is better• Spacing depends on location of violations

• Even spacing is good enough

51

Query Execution

Actions taken by a query: Bring pages into buffer pool Acquire and release latches & locks Allocate/free memory Allocate/free and use cursors Use B-trees Generate log entries

These generate violations.These generate violations.

52

Applying TLS

1. Parallelize loop2. Run benchmark3. Remove bottleneck4. Go to 2 T

ime

53

Outline

Hardware Developer

TransactionProgrammer

DBMS Programmer

54

Violation Prediction

*q==*qDone

Predict Dependences

Pre

dic

tor

*q==*q

=*q

Violation!

Eliminate R1/W2

Tim

e

55

Violation Prediction

*q==*qDone

Predict Dependences

Pre

dic

tor

Tim

e

Predictor problems: Large epochs

many predictions Failed prediction

violation Incorrect prediction

large stall

Two predictors required:

Last store Dependent load

56

TLS Execution

*p=

*q=

=*p

R2

Violation!

=*p

=*q

CPU 1

L1 $

L2 $


CPU 2

L1 $

CPU 3

L1 $

CPU 4

L1 $L1 $ L1 $ L1 $ L1 $

57

TLS Execution

*p=

*q=

=*p

R2

Violation!

=*p

=*q

*p11

*s1

*t

1

1

1

1valid CPU 2 CPU 3

SM

SL

SM

SL

CPU 4

SM

SL

CPU 1

SM

SL

58

TLS Execution

*p=

*q=

=*p

R2

Violation!

=*p

=*q

*p11

*s1

*t

1

1

1

1

1

valid CPU 2 CPU 3

SM

SL

SM

SL

CPU 4

SM

SL

CPU 1

SM

SL

59

TLS Execution

*p=

*q=

=*p

R2

Violation!

=*p

=*q

*p11 1

valid CPU 2 CPU 3

SM

SL

SM

SL

CPU 4

SM

SL

CPU 1

SM

SL

60

TLS Execution

*p=

*q=

=*p

R2

Violation!

=*p

=*q

*q1

*p11 1

valid CPU 2 CPU 3

SM

SL

SM

SL

CPU 4

SM

SL

CPU 1

SM

SL

1

61

TLS Execution

*p=

*q=

=*p

R2

Violation!

=*p

=*q

1 *q1

*p11 1

valid CPU 2 CPU 3

SM

SL

SM

SL

CPU 4

SM

SL

CPU 1

SM

SL

1

62

Replication

*p=

*q=

=*p

R2

Violation!

=*p

=*q

*q=

*p11 1

1 *q11 1 *q11

Can’t invalidate line if it contains two epoch’s changes

Can’t invalidate line if it contains two epoch’s changes

valid CPU 2 CPU 3

SM

SL

SM

SL

CPU 4

SM

SL

CPU 1

SM

SL

63

Replication

*p=

*q=

=*p

R2

Violation!

=*p

=*q

*q=

1

*p11 1

1 *q1

*q11

valid CPU 2 CPU 3

SM

SL

SM

SL

CPU 4

SM

SL

CPU 1

SM

SL

64

Replication

*p=

*q=

=*p

R2

Violation!

=*p

=*q

*q=

Makes epochs independent Enables sub-epochs

1

*p11 1

1 *q1

*q11

valid CPU 2 CPU 3

SM

SL

SM

SL

CPU 4

SM

SL

CPU 1

SM

SL

65

Sub-epochs

*p=*q==*p

=*q

*p1

*q1

*q1

valid CPU 1b CPU 1c

SM

SL

SM

SL

CPU 1d

SM

SL

CPU 1a

SM

SL

1

1*q=

1

1*p=

*p1 1

1

1a

1b

1c

1d …………

Uses more epoch contexts Detection/buffering/rewind is “free” More replication:

Speculative victim cache

66

get_page() wrapper

page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret;

tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut);

ret = get_page(id);

tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation()

return ret;}

Wraps get_page()

67

get_page() wrapper



ret = get_page(id);


return ret;}

No violations while calling get_page()

68

May get bad input data from speculative thread!

get_page() wrapper



ret = get_page(id);


return ret;}

69

get_page() wrapper



ret = get_page(id);


return ret;}

Only one epoch per transaction at a time

70

How to undo get_page()

get_page() wrapper



ret = get_page(id);


return ret;}

71

get_page() wrapper



ret = get_page(id);


return ret;}

Isolated Undoing this operation

does not cause cascading aborts

Undoable Easy way to return

system to initial state

Can also be used for: Cursor management malloc()

72

TPC-C Benchmark

Company

Warehouse 1 Warehouse W

District 1 District 2 District 10

Cust 1 Cust 2 Cust 3k

73

TPC-C Benchmark

WarehouseW

DistrictW*10

CustomerW*30k

OrderW*30k+

Order LineW*300k+

New OrderW*9k+

HistoryW*30k+

StockW*100k

Item100k

10

3k

1+

1+

0-1

5-15

3+W

100k

74

What is TLS?

while(cond) {

x = hash[i];

...

hash[j] = y;

...

}

Tim

e

= hash[3];

...

hash[10] =

...

= hash[19];

...

hash[21] =

...

= hash[33];

...

hash[30] =

...

= hash[10];

...

hash[25] =

...

75

What is TLS?

while(cond) {

x = hash[i];

...

hash[j] = y;

...

}

Tim

e

= hash[3];

...

hash[10] =

...

= hash[19];

...

hash[21] =

...

= hash[33];

...

hash[30] =

...

= hash[10];

...

hash[25] =

...

Processor A Processor B Processor C Processor D

Thread 1 Thread 2 Thread 3 Thread 4

76

What is TLS?

while(cond) {

x = hash[i];

...

hash[j] = y;

...

}

Tim

e

= hash[3];

...

hash[10] =

...

= hash[19];

...

hash[21] =

...

= hash[33];

...

hash[30] =

...

= hash[10];

...

hash[25] =

...



Violation!

77

What is TLS?

while(cond) {

x = hash[i];

...

hash[j] = y;

...

}

Tim

e

= hash[3];

...

hash[10] =

...attempt_commit()

= hash[19];

...

hash[21] =

...attempt_commit()

= hash[33];

...

hash[30] =

...attempt_commit()



Violation!

= hash[10];

...

hash[25] =

...attempt_commit()

78

What is TLS?

while(cond) {

x = hash[i];

...

hash[j] = y;

...

}

Tim

e

= hash[3];

...

hash[10] =

...attempt_commit()

= hash[19];

...

hash[21] =

...attempt_commit()

= hash[33];

...

hash[30] =

...attempt_commit()

= hash[10];

...

hash[25] =

...attempt_commit()



Violation!

Redo

= hash[10];

...

hash[25] =

...attempt_commit()

Thread 4

79

TLS Hardware Design

What’s new? Large threads Epochs will communicate Complex control flow Huge legacy code base

How does hardware change? Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences

Aggressive update propagation (implicit forwarding) Sub-epochs

80

L1 Cache Line

SL bit “L2 cache knows this line has been speculatively

loaded” On violation or commit: clear

SM bit “This line contains speculative changes” On commit: clear On violation: SM Invalid

Otherwise, just like a normal cache

SL

SM

Valid

LR

U

Tag

Data

81

escaping Speculation

Speculative epoch wants to make system visible change!

Ignore SM lines while escaped Stale bit

“This line may be outdated by speculative work.” On violation or commit: clear

SL

SM

Valid

LR

U

Tag

Data

Sta

le

82

L1 to L2 communication

L2 sees all stores (write through) L2 sees first load of an epoch

NotifySL message

L2 can track data dependences!

83

L1 Changes Summary

Add three bits to each line SL SM Stale

Modify tag match to recognize bits Add queue of NotifySL requests

SL

SM

Valid

LR

U

Tag

Data

Sta

le

84

L2 Cache Line

SL

SM

Fin

e G

rain

ed

SM

Valid

Dir

ty

Exclu

siv

e

LR

U

Tag

Data

SL

SM

CPU1 CPU2

Cache line can be: Modified by one CPU Loaded by multiple CPUs

85

Cache Line Conflicts

Three classes of conflict: Epoch 2 stores, epoch 1 loads

Need “old” version to load Epoch 1 stores, epoch 2 stores

Need to keep changes separate Epoch 1 loads, epoch 2 stores

Need to be able to discard line on violation

Need a way of storing multiple conflicting versions in the cache

86

Cache line replication

On conflict, replicate line Split line into two copies Divide SM and SL bits at split point Divide directory bits at split point

87

Replication Problems

Complicates line lookup Need to find all replicas and select “best” Best == most recent replica

Change management On write, update all later copies Also need to find all more speculative

replicas to check for violations On commit must get rid of stale lines

Invalidation Required Buffer (IRB)

88

Victim Cache

How do you deal with a full cache set? Use a victim cache

Holds evicted lines without losing SM & SL bits

Must be fast every cache lookup needs to know: Do I have the “best” replica of this line?

Critical path Do I cause a violation?

Not on critical path

89

Summary of Hardware Support

Sub-epochs Violations hurt less!

Shared cache TLS support Faster communication More room to store state

RAOs Don’t speculate on known operations Reduces amount of speculative state

90

Summary of Hardware Changes

Sub-epochs Checkpoint register state Needs replicas in cache

Shared cache TLS support Speculative L1 Replication in L1 and L2 Speculative victim cache Invalidation Required Buffer

RAOs Suspend/resume speculation Mutexes “Undo list”

91

TLS Execution

*p=

*q=

=*p

R2

Violation!

=*p

=*q

CPU

L1 $

L2 $


CPU

L1 $

CPU

L1 $

CPU

L1 $L1 $ L1 $ L1 $ L1 $

*p

Invalidation

*p*p

*q

*p

*q*q

92

93

94

95

Problems with Old Cache Design

Database epochs are large L1 cache not large enough

Sub-epochs add more state L1 cache not associative enough

Database epochs communicate L1 cache only communicates committed

data

96

Intro Summary

TLS makes intra-transaction parallelism easy

Divide transaction into epochs Hardware support:

Detect violations Restart to recover

Sub-epochs mitigate penalty Buffer state

New process: Modify software

avoid violations improve performance

97

Money! pwhile()

{}

Money!pwhile()

{}

Money!pwhile()

{}

Money! pwhile()

{}

Money!pwhile()

{}

Money!pwhile()

{}

Money!pwhile()

{}

Money!pwhile()

{}

Money! pwhile()

{}

Money!pwhile()

{}

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

The Many Faces of Ogg

Must tune reorder buffer…

98



Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Money!pwhile()

{}

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

99

Removing Bottlenecks

Three general techniques: Partition data structures

malloc Postpone operations until non-

speculative Latches and locks, log entries

Handle speculation manually Buffer pool

100

Bottlenecks Encoutered

Buffer pool Latches & Locks Malloc/free Cursor queues Error checks False sharing B-tree performance optimization Log entries

101



Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Money! pwhile()

{}

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

102

Performance on 4 CPUs

Unmodified benchmark: Modified benchmark:

103

Incremental Parallelization

Tim

e4 CPUs

104

ScalingUnmodified benchmark: Modified benchmark:

105

Parallelization is Hard

Programmer Effort

Perf

orm

ance

Im

pro

vem

ent

Hand Parallelization

What we want

Parallelizing Compiler

Tuning

Tuning

TuningTuning

106


Begin transaction {

} End transaction

107


Begin transaction { Read customer info Read & increment order # Create new order

} End transaction

108


Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info }} End transaction

109




110




111


Duh…pwhile()

{}

while(too_slow)

make_faster();

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Money!pwhile()

{}

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();


Step 2: Changing the Software

113

No problem!

Loop is easy to parallelize using TLS! Not really Calls into DBMS invoke complex

operations Ogg needs to do some work

Many operations in DBMS are parallel Not written with TLS in mind!

114

Resource Management

Mutexes acquired and released

Locks locked and unlocked

Cursors pushed and popped from free stack

Memory allocated and freed

Buffer pool entries Acquired and released

115

Mutexes: Deadlock?

Problem: Re-ordered acquire/release operations!

Possibly introduced deadlock?

Solutions: Avoidance:

Static acquire order Recovery:

Detect deadlock and violate

116

Locks

Like mutexes, but: Allows multiple readers No memory overhead when not held Often held for much longer

Treat similarly to mutexes

117

Cursors

Used for traversing B-trees Pre-allocated, kept in pools

118

Maintaining Cursor Pool

Get

Use

Release

head

119


Get

Use

Release

head

120


Get

Use

Release

head

121


Get

Use

Release

head

Get

Use

Release

Violation!

122

Parallelizing Cursor Pool

Use per-CPU pools: Modify code: each CPU gets its own

pool No sharing == no violations! Requires cpuid() instruction

Get

Use

Release

head Get

Use

Release

head

123

Memory Allocation

Problem: malloc() metadata causes

dependences

Solutions: Per-cpu memory pools Parallelized free list

124

The Log

Append records to global log Appending causes dependence Can’t parallelize:

Global log sequence number (LSN) Generate log records in buffers Assign LSNs when homefree

125

B-Trees

Leaf pages contain free space counts Inserts of random records – o.k. Inserting adjacent records

Dependence on decrementing count Page splits

Infrequent

126

Other Dependences

Statistics gathering Error checks False sharing

127

Related Work

Lots of work in TLS: Multiscalar (Wisconsin) Hydra (Stanford) IACOMA (Illinois) RAW (MIT)

Hand parallelizing using TLS: Manohar Prabhu and Kunle Olukotun

(PPoPP’03)

Any questions?

129

Why is this a problem?

B-tree insertion into ORDERLINE table Key is ol_n

DBMS does not know that keys will be sequential

Each insert usually updates the same btree page

for(ol_n=0; ol_n<15; ol_n++) { INSERT into ORDERLINE (ol_n, ol_item, ol_cnt) VALUES (:ol_n, :ol_item, :ol_cnt);}

130

Sequential Btree Inserts

4

free free

1

item free

free free

free free

4

item item

item item

free free

3

item item

item free

free free

2

item item

free free

free free

132

Outline

Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences

Aggressive update propagation (implicit forwarding)

Sub-epochs Results and analysis

133

Outline




134

Tolerating dependences

Aggressive update propagation Get for free!

Sub-epochs Periodically checkpoint epochs Every N instructions?

Picking N may be interesting Perhaps checkpoints could be set before the

location of previous violations?

135

Outline




136

Why not faster?

Possible reasons: Idle cpus RAO mutexes Violations Cache effects Data dependences

137

Why not faster?

Possible reasons: Idle cpus

9 epochs/region average Two bundles of four and one of one ¼ of cpu cycles wasted!

RAO mutexes Violations Cache effects Data dependences

138

Why not faster?

Possible reasons: Idle cpus RAO mutexes

Not implemented yet Ooops!

Violations Cache effects Data dependences

139

Why not faster?

Possible reasons: Idle cpus RAO mutexes Violations

21/969 epochs violated Distance 1 “magic synchronized” 2.2Mcycles (over 4 cpus)

About 1.5%

Cache effects Data dependences

140

Why not faster?

Possible reasons: Idle cpus RAO mutexes Violations Cache effects

Deserves its own slide. Data dependences

141

Cache effects of speculation

Only 20% of references are speculative! Speculative references have small impact

on non-speculative hit rate (<1%) Speculative refs miss a lot in L1

9-15% for reads, 2-6% for writes L2 saw HUGE increase in traffic

152k refs to 3474k refs Spec/non spec lines are thrashing from L1s

142

Why not faster?

Possible reasons: Idle cpus RAO mutexes Violations Cache effects Data dependences

Oh yeah! Btree item count

Split up btree insert? alloc and write Do alloc as RAO Needs more thought

143

L2 Cache Line

SL

SM

Fin

e G

rain

ed

SM

Valid

Dir

ty

Exclu

siv

e

LR

U

Tag

Data

SL

SM

CPU1 CPU2

Set

1S

et

2

144

Why are you here?

Want faster database systems Have funky new hardware –


How can we apply TLS todatabase systems?

Side question: Is this a VLDB or an ASPLOS talk?

145

How?

Divide transaction into TLS-threads Run TLS-threads in parallel;

maintain sequential semantics Profit!

146

Why parallelize transactions?

Decrease transaction latency Increase concurrency while avoiding

concurrency control bottleneck A.k.a.: use more CPUs, same # of

xactions

The obvious: Database performance matters

147

Shopping List

What do we need? (research scope)

Cheap hardware Thread Level Speculation (TLS)

Minor changes allowed.

Important database application TPC-C

Almost no changes allowed!

Modular database system BerkeleyDB

Some changes allowed.

148

Outline

TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

149

Outline


150

What’s new?

Database operations are: Large Complex

Large TLS-threads Lots of dependences Difficult to analyze

Want: Programmer optimization effort =

faster program

151

Hardware changes summary

Must tolerate dependences Prediction? Implicit forwarding?

May need larger caches May need larger associativity

152

Outline


153

Parallelization Strategy

1. Pick a benchmark2. Parallelize a loop3. Analyze dependences4. Optimize away dependences5. Evaluate performance6. If not satisfied, goto 3

154

Outline

TLS Hardware The Benchmark (TPC-C) Changing the database system

Resource management The log B-trees False sharing

Results Conclusions

155

Outline


156

Results

Viola simulator: Single CPI Perfect violation prediction No memory system 4 cpus Exhaustive dependence tracking

Currently working on an out-of-order superscalar simulation (cello)

10 transaction warm-up Measure 100 transactions

157

Outline


158

Conclusions

TLS can improve transaction latency Violation predictors important

Iff dependences must be tolerated TLS makes hand parallelizing easier

159

Improving Database Performance

How to improve performance: Parallelize transaction Increase number of concurrent

transactions

Both of these require independence of database operations!

160


Begin transaction {

} End transaction

161


Begin transaction { Read customer info (customer, warehouse) Read & increment order # (district) Create new order (orders, neworder)

} End transaction

162


Begin transaction { Read customer info (customer, warehouse) Read & increment order # (district) Create new order (orders, neworder) For each item in order { Get item info (item) Decrement count in stock (stock) Record order info (orderline) }} End transaction

163



Parallelizethis loop

164



Parallelizethis loop

165

Implementing on a Real DB

Using BerkeleyDB “Table” == “Database” Give database any arbitrary key will return arbitrary data (bytes) Use structs for keys and rows Database provides ACID through:

Transactions Locking (page level) Storage management

Provides indexing using b-trees

166

Parallelizing a Transaction

For each item in order { Get item info (item) Decrement count in stock (stock) Record order info (order line)}

167

Parallelizing a Transaction

For each item in order { Get item info (item) Decrement count in stock (stock) Record order info (order line)}

•Get cursor from pool•Use cursor to traverse b-tree•Find row, lock page for row•Release cursor to pool

•Get cursor from pool•Use cursor to traverse b-tree•Find row, lock page for row•Release cursor to pool

168


Get

Use

Release

head

169


Get

Use

Release

head

170


Get

Use

Release

head

171


Get

Use

Release

head

Get

Use

Release

Violation!

172

Parallelizing Cursor Pool 1

Use per-CPU pools: Modify code: each CPU gets its own

pool No sharing == no violations! Requires cpuid() instruction

Get

Use

Release

head Get

Use

Release

head

173


Dequeue and enqueue: atomic and unordered

Delay enqueue until end of thread Forces separate pools Avoids modification

of data structGet

Use

Release

head Get

Use

Release

174


Atomic unordered dequeue & enqueue

Cursor struct is “TLS unordered”

Struct defined as a byte range in memory

Get

Use

Release

headGet

Use

Release

Get

Use

Release

Get

Use

Release

175


Mutex protect dequeue & enqueue;declare pointer to cursor struct to be “TLS unordered”

Any access through pointer does not have TLS applied

Pointer is tainted, any copies of it keep this property

176

Problems with 3 & 4

What exactly is the boundary of a structure?

How do you express the concept of object in a loosely-typed language like C?

A byte range or a pointer is only an approximation.

Dynamically allocated sub-components?

177

Mutexes in a TLS world

Two types of threads: “real” threads TLS threads

Two types of mutexes: Inter-real-thread Inter-TLS-thread

178

Inter-real-thread Mutexes

Acquire == get mutex for all TLS threads

Release == release for current TLS thread May still be held by another TLS thread!

179

Inter-TLS-thread Mutexes

Should never interact between two real threads

Implies no TLS ordering between TLS threads while mutex is held

But what do to on a violation? Can’t just throw away changes to memory Must undo operations performed in

critical section

180

Parallelizing Databases using TLS

Split transactions into threads Threads created are large

60k+ instructions 16kB of speculative state More dependences between threads

How do we design a machine which can handle these large threads?

181

The “Old” Way

P

L1 $

L2 $

P

L1 $

P

L1 $

P

L1 $

Committed state

L3 $

P

L1 $

L2 $

P

L1 $

P

L1 $

P

L1 $

Memory System

…

Speculative state

182

The “Old” Way

Advantages Each epoch has its own L1 cache Epoch state does not intermix

Disadvantages L1 cache is too small! Full cache == dead meat No shared speculative memory

183

The “New” Way

L2 cache is huge! State of the art in caches, Power5:

1.92MB 10-way L2 32kB 4-way L1

Shared speculative memory “for free” Keeps TLS logic off of the critical path

184

TLS Shared L2 Design

L1 Write-through write-no-allocate [CullerSinghGupta99]

Easy to understand and reason about Writes visible to L2 – simplifies shared

speculative memory L2 cache: shared cache architecture with

replication Rest of memory: distributed TLS coherence

185


P

L1 $

L2 $

P

L1 $

P

L1 $

P

L1 $

“Real” speculative state

L3 $

P

L1 $

L2 $

P

L1 $

P

L1 $

P

L1 $

Memory System

…

Cached speculative state

Explain from the top down

186


P

L1 $

L2 $

P

L1 $

P

L1 $

P

L1 $

“Real” speculative state

L3 $

P

L1 $

L2 $

P

L1 $

P

L1 $

P

L1 $

Memory System

…

Cached speculative state

Part II: Dealing with dependences

188

Predictor Design

How do you design a predictor that: Identifies violating loads Identifies the last store that causes them Only triggers when they cause a problem Has very very high accuracy

???

189

Sub-epoch design

Like checkpointing Leave “holes” in epoch # space Every 5k instructions start a new

epoch Uses more cache to buffer changes

More strain on associativity/victim cache Uses more epoch contexts

190

Summary

Supporting large epochs needs: Buffer state in L2 instead of L1 Shared speculative memory Replication Victim cache Sub-epochs

Any questions?

Documents

Optimistic Intra-Transaction Parallelism using Thread Level Speculation