190
Optimistic Intra- Transaction Parallelism using Thread Level Speculation Chris Colohan 1 , Anastassia Ailamaki 1 , J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie Mellon University 2 University of Toronto Intel Research Pittsburgh

Optimistic Intra-Transaction Parallelism using Thread Level Speculation

  • Upload
    chana

  • View
    36

  • Download
    2

Embed Size (px)

DESCRIPTION

Optimistic Intra-Transaction Parallelism using Thread Level Speculation. Chris Colohan 1 , Anastassia Ailamaki 1 , J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie Mellon University 2 University of Toronto 3 Intel Research Pittsburgh. Chip Multiprocessors are Here!. - PowerPoint PPT Presentation

Citation preview

Page 1: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Chris Colohan1, Anastassia Ailamaki1,J. Gregory Steffan2 and Todd C. Mowry1,3

1Carnegie Mellon University2University of Toronto

3Intel Research Pittsburgh

Page 2: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

2

Chip Multiprocessors are Here!

2 cores now, soon will have 4, 8, 16, or 32 Multiple threads per core How do we best use them?

IBM Power 5

AMD Opteron

Intel Yonah

Page 3: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

3

Multi-Core Enhances Throughput

Database ServerUsers

Transactions DBMS Database

Cores can run concurrent transactions and improve

throughput

Cores can run concurrent transactions and improve

throughput

Page 4: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

4

Using Multiple CoresDatabase ServerUsers

Transactions DBMS Database

Can multiple cores improvetransaction latency?

Can multiple cores improvetransaction latency?

Page 5: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

5

Parallelizing transactions

SELECT cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line;}

DBMS

Intra-query parallelism Used for long-running queries (decision support) Does not work for short queries

Short queries dominate in commercial workloads

Page 6: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

6

Parallelizing transactions

SELECT cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line;}

DBMS

Intra-transaction parallelism Each thread spans multiple queries

Hard to add to existing systems! Need to change interface, add latches and locks,

worry about correctness of parallel execution…

Page 7: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

7

Parallelizing transactions

SELECT cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line;}

DBMS

Intra-transaction parallelism Breaks transaction into threads

Hard to add to existing systems! Need to change interface, add latches and locks,

worry about correctness of parallel execution…

Thread Level Speculation (TLS)makes parallelization easier.

Thread Level Speculation (TLS)makes parallelization easier.

Page 8: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

8

Thread Level Speculation (TLS)

*p=

*q=

=*p

=*q

Sequential

Tim

e

Parallel

*p=

*q=

=*p

=*q

=*p

=*q

Epoch 1 Epoch 2

Page 9: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

9

Thread Level Speculation (TLS)

*p=

*q=

=*p

=*q

Sequential

Tim

e

*p=

*q=

=*p

R2

Violation!

=*p

=*q

Parallel

Use epochs

Detect violations Restart to

recover Buffer state

Oldest epoch: Never restarts No buffering

Worst case: Sequential

Best case: Fully parallelData dependences limit performance.Data dependences limit performance.

Epoch 1 Epoch 2

Page 10: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

10

TransactionProgrammer

DBMS Programmer

Hardware Developer

A Coordinated Effort

Choose epoch boundaries

Choose epoch boundaries

Remove performance bottlenecks

Remove performance bottlenecks

Add TLS support to architecture

Add TLS support to architecture

Page 11: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

11

So what’s new?

Intra-transaction parallelism Without changing the transactions With minor changes to the DBMS Without having to worry about locking Without introducing concurrency bugs With good performance

Halve transaction latency on four cores

Page 12: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

12

Related Work

Optimistic Concurrency Control (Kung82)

Sagas (Molina&Salem87)

Transaction chopping (Shasha95)

Page 13: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

13

Outline

Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions

Page 14: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

14

Case Study: New Order (TPC-C)

Only dependence is the quantity field Very unlikely to occur (1/100,000)

GET cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line;}

78% of transactionexecution time

Page 15: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

15

Case Study: New Order (TPC-C)GET cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line;}

GET cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;

TLS_foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line;}

GET cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;

TLS_foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line;}

Page 16: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

16

Outline

Introduction Related work Dividing transactions into epochs Removing bottlenecks in the

DBMS Results Conclusions

Page 17: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

17

Dependences in DBMSTim

e

Page 18: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

18

Dependences in DBMSTim

e

Dependences serialize execution!

Example: statistics gathering pages_pinned++ TLS maintains serial ordering of

increments To remove, use per-CPU counters

Performance tuning: Profile execution Remove bottleneck dependence Repeat

Page 19: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

19

Buffer Pool Management

CPU

Buffer Pool

get_page(5)

ref: 1

put_page(5)

ref: 0

Page 20: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

20

get_page(5)put_page(5)

Buffer Pool Management

CPU

Buffer Pool

get_page(5)

ref: 0

put_page(5)

Tim

e

get_page(5)

put_page(5)

TLS ensures first epoch gets page

first.Who cares?

TLS ensures first epoch gets page

first.Who cares?

TLS maintains original load/store order

Sometimes this is not needed

get_page(5)put_page(5)

Page 21: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

21

Buffer Pool Management

CPU

Buffer Pool

get_page(5)

ref: 0

put_page(5)

Tim

e

get_page(5)

put_page(5)

= Escape SpeculationIsolated: undoing get_page will not affect other transactionsUndoable: have an operation (put_page) which returns the system to its initial state

Isolated: undoing get_page will not affect other transactionsUndoable: have an operation (put_page) which returns the system to its initial state

• Escape speculation• Invoke operation• Store undo function• Resume speculation

• Escape speculation• Invoke operation• Store undo function• Resume speculation

get_page(5)put_page(5)

put_page(5)get_page(5)

Page 22: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

22

Buffer Pool Management

CPU

Buffer Pool

get_page(5)

ref: 0

put_page(5)

Tim

e

get_page(5)

put_page(5)

get_page(5)put_page(5)

Not undoable!

Not undoable!

get_page(5)put_page(5)

= Escape Speculation

Page 23: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

23

Buffer Pool Management

CPU

Buffer Pool

get_page(5)

ref: 0

put_page(5)

Tim

e

get_page(5)

put_page(5)

get_page(5)

put_page(5)

Delay put_page until end of epoch Avoid dependence

= Escape Speculation

Page 24: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

24

Removing Bottleneck Dependences

We introduce three techniques: Delay operations until non-speculative

Mutex and lock acquire and release Buffer pool, memory, and cursor release Log sequence number assignment

Escape speculation Buffer pool, memory, and cursor allocation

Traditional parallelization Memory allocation, cursor pool, error

checks, false sharing

Page 25: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

25

Outline

Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions

Page 26: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

26

Experimental Setup

Detailed simulation Superscalar, out-of-

order, 128 entry reorder buffer

Memory hierarchy modeled in detail

TPC-C transactions on BerkeleyDB In-core database Single user Single warehouse Measure interval of 100

transactions Measuring latency not

throughput

CPU

32KB4-wayL1 $

Rest of memory system

Rest of memory system

CPU

32KB4-wayL1 $

CPU

32KB4-wayL1 $

CPU

32KB4-wayL1 $

2MB 4-way L2 $

Page 27: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

27

Optimizing the DBMS: New Order

0

0.25

0.5

0.75

1

1.25

Seque

ntial

No Opt

imiza

tions

Latch

es

Lock

s

Mall

oc/F

ree

Buffe

r Poo

l

Curso

r Que

ue

Error C

heck

s

False

Sharin

g

B-Tre

e

Logg

ing

Tim

e (n

orm

aliz

ed)

Idle CPU

ViolatedCache Miss

Busy

Cache misses

increase

Cache misses

increase

Other CPUs not helping

Other CPUs not helping

Can’t optimize

much more

Can’t optimize

much more

26% improvemen

t

26% improvemen

t

Page 28: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

28

Optimizing the DBMS: New Order

0

0.25

0.5

0.75

1

1.25

Seque

ntial

No Opt

imiza

tions

Latch

es

Lock

s

Mall

oc/F

ree

Buffe

r Poo

l

Curso

r Que

ue

Error C

heck

s

False

Sharin

g

B-Tre

e

Logg

ing

Tim

e (n

orm

aliz

ed)

Idle CPU

ViolatedCache Miss

Busy

This process took me 30 days and <1200 lines of code.

This process took me 30 days and <1200 lines of code.

Page 29: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

29

Other TPC-C Transactions

0

0.25

0.5

0.75

1

New Order Delivery Stock Level Payment Order Status

Tim

e (

no

rma

lize

d)

Idle CPU

FailedCache Miss

Busy

3/5 Transactions speed up by 46-66%

Page 30: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

30

Conclusions

TLS makes intra-transaction parallelism practical Reasonable changes to transaction,

DBMS, and hardware Halve transaction latency

Page 31: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

31

Needed backup slides (not done yet)

2 proc. Results Shared caches may change how you

want to extract parallelism! Just have lots of transactions: no sharing TLS may have more sharing

Page 32: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Any questions?

For more information, see:www.colohan.com

Page 33: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Backup Slides Follow…

Page 34: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

LATCHES

Page 35: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

35

Latches

Mutual exclusion between transactions Cause violations between epochs

Read-test-write cycle RAW Not needed between epochs

TLS already provides mutual exclusion!

Page 36: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

36

Latches: Aggressive Acquire

Acquirelatch_cnt++…work…latch_cnt--

Homefree

latch_cnt++…work…(enqueue release)

Commit worklatch_cnt--

Homefree

latch_cnt++…work…(enqueue release)

Commit worklatch_cnt--Release

Larg

e c

riti

cal secti

on

Page 37: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

37

Latches: Lazy Acquire

Acquire…work…Release

Homefree

(enqueue acquire) …work…(enqueue release)

AcquireCommit workRelease

Homefree

(enqueue acquire)…work…(enqueue release)

AcquireCommit workRelease

Sm

all c

riti

cal secti

on

s

Page 38: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

HARDWARE

Page 39: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

39

TLS in Database Systems

Non-DatabaseTLS

Tim

e

TLS in DatabaseSystems

Large epochs:• More dependences

• Must tolerate

• More state• Bigger buffers

Large epochs:• More dependences

• Must tolerate

• More state• Bigger buffers

Concurrenttransactions

Page 40: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

40

Feedback Loop I know this is

parallel!

for() { do_work();}

par_for() { do_work();}

Must…Make…Faster

think

feed back feed back feed back feed back feed back feed back feed back feed back feed back feed back feed

Page 41: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

41

Violations == Feedback

*p=

*q=

=*p

=*q

Sequential

Tim

e

*p=

*q=

=*p

R2

Violation!

=*p

=*q

Parallel

0x0FD80xFD200x0FC00xFC18

Must…Make…Faster

Must…Make…Faster

Page 42: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

42

Eliminating Violations

*p=

*q=

=*p

R2

Violation!

=*p

=*q

Parallel

*q==*q

=*q

Violation!

Eliminate *p Dep.

Tim

e

0x0FD80xFD200x0FC00xFC18

Optimization maymake slower?

Page 43: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

43

Tolerating Violations: Sub-epochs

Tim

e

*q=Violation!

Sub-epochs

=*q

=*q

*q==*q

=*q

Violation!

Eliminate *p Dep.

Page 44: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

44

Sub-epochs

Started periodically by hardware How many? When to start?

Hardware implementation Just like epochs

Use more epoch contexts No need to check

violations between sub-epochs within an epoch

*q=Violation!

Sub-epochs

=*q

=*q

Page 45: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

45

Old TLS Design

CPU

L1 $

L2 $

Rest of memory systemRest of memory system

CPU

L1 $

CPU

L1 $

CPU

L1 $L1 $ L1 $ L1 $ L1 $

Buffer speculative state in write back L1

cache

Invalidation

Detect violations through

invalidations

Rest of system only sees committed

data

Restart by invalidating speculative lines

Problems:

• L1 cache not large enough• Later epochs only get values on commit

Problems:

• L1 cache not large enough• Later epochs only get values on commit

Page 46: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

46

New Cache Design

CPU

L1 $

L2 $

Rest of memory systemRest of memory system

CPU

L1 $

CPU

L1 $

CPU

L1 $

Buffer speculative and non-speculative state for all epochs

in L2

Speculative writes immediately visible

to L2 (and later epochs)

Detect violations at lookup time

Invalidation coherence between L2 caches

L1 $ L1 $ L1 $ L1 $

L2 $

Invalidation

Restart by invalidating speculative lines

Page 47: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

47

New Features

CPU

L1 $

L2 $

Rest of memory systemRest of memory system

CPU

L1 $

CPU

L1 $

CPU

L1 $L1 $ L1 $ L1 $ L1 $

L2 $

Speculative state in L1 and L2 cache

Cache line replication (versions)

Data dependence tracking within cache

Speculative victim cache

New

!

New

!

Page 48: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

48

Scaling

0

0.25

0.5

0.75

1

Seque

ntial

2 CPUs

4 CPUs

8 CPUs

Tim

e (n

orm

aliz

ed)

0

0.25

0.5

0.75

1

Seque

ntial

2 CPUs

4 CPUs

8 CPUs

Modified with 50-150 items/transaction

Idle CPU

Failed Speculation

IUO Mutex Stall

Cache Miss

Instruction Execution

Page 49: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

49

Evaluating a 4-CPU system

0

0.25

0.5

0.75

1

Seque

ntial

TLS S

eq

No Sub

-epo

ch

Baseli

ne

No Spe

culat

ion

Tim

e (n

orm

aliz

ed)

Idle CPU

Failed Speculation

IUO Mutex Stall

Cache Miss

Instruction Execution

Original benchmark run on

1 CPU

Original benchmark run on

1 CPU

Parallelized benchmark run on 1

CPU

Parallelized benchmark run on 1

CPU

Without sub-epoch support

Without sub-epoch support

Parallel execution

Parallel execution

Ignore violations (Amdahl’s Law limit)

Ignore violations (Amdahl’s Law limit)

Page 50: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

50

Sub-epochs: How many/How big?

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Number of Sub-epochs/Instructions per Sub-epoch

Tim

e (n

orm

aliz

ed)

• Supporting more sub-epochs is better• Spacing depends on location of violations

• Even spacing is good enough

• Supporting more sub-epochs is better• Spacing depends on location of violations

• Even spacing is good enough

Page 51: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

51

Query Execution

Actions taken by a query: Bring pages into buffer pool Acquire and release latches & locks Allocate/free memory Allocate/free and use cursors Use B-trees Generate log entries

These generate violations.These generate violations.

Page 52: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

52

Applying TLS

1. Parallelize loop2. Run benchmark3. Remove bottleneck4. Go to 2 T

ime

Page 53: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

53

Outline

Hardware Developer

TransactionProgrammer

DBMS Programmer

Page 54: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

54

Violation Prediction

*q==*qDone

Predict Dependences

Pre

dic

tor

*q==*q

=*q

Violation!

Eliminate R1/W2

Tim

e

Page 55: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

55

Violation Prediction

*q==*qDone

Predict Dependences

Pre

dic

tor

Tim

e

Predictor problems: Large epochs

many predictions Failed prediction

violation Incorrect prediction

large stall

Two predictors required:

Last store Dependent load

Page 56: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

56

TLS Execution

*p=

*q=

=*p

R2

Violation!

=*p

=*q

CPU 1

L1 $

L2 $

Rest of memory systemRest of memory system

CPU 2

L1 $

CPU 3

L1 $

CPU 4

L1 $L1 $ L1 $ L1 $ L1 $

Page 57: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

57

TLS Execution

*p=

*q=

=*p

R2

Violation!

=*p

=*q

*p11

*s1

*t

1

1

1

1valid CPU 2 CPU 3

SM

SL

SM

SL

CPU 4

SM

SL

CPU 1

SM

SL

Page 58: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

58

TLS Execution

*p=

*q=

=*p

R2

Violation!

=*p

=*q

*p11

*s1

*t

1

1

1

1

1

valid CPU 2 CPU 3

SM

SL

SM

SL

CPU 4

SM

SL

CPU 1

SM

SL

Page 59: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

59

TLS Execution

*p=

*q=

=*p

R2

Violation!

=*p

=*q

*p11 1

valid CPU 2 CPU 3

SM

SL

SM

SL

CPU 4

SM

SL

CPU 1

SM

SL

Page 60: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

60

TLS Execution

*p=

*q=

=*p

R2

Violation!

=*p

=*q

*q1

*p11 1

valid CPU 2 CPU 3

SM

SL

SM

SL

CPU 4

SM

SL

CPU 1

SM

SL

1

Page 61: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

61

TLS Execution

*p=

*q=

=*p

R2

Violation!

=*p

=*q

1 *q1

*p11 1

valid CPU 2 CPU 3

SM

SL

SM

SL

CPU 4

SM

SL

CPU 1

SM

SL

1

Page 62: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

62

Replication

*p=

*q=

=*p

R2

Violation!

=*p

=*q

*q=

*p11 1

1 *q11 1 *q11

Can’t invalidate line if it contains two epoch’s changes

Can’t invalidate line if it contains two epoch’s changes

valid CPU 2 CPU 3

SM

SL

SM

SL

CPU 4

SM

SL

CPU 1

SM

SL

Page 63: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

63

Replication

*p=

*q=

=*p

R2

Violation!

=*p

=*q

*q=

1

*p11 1

1 *q1

*q11

valid CPU 2 CPU 3

SM

SL

SM

SL

CPU 4

SM

SL

CPU 1

SM

SL

Page 64: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

64

Replication

*p=

*q=

=*p

R2

Violation!

=*p

=*q

*q=

Makes epochs independent Enables sub-epochs

1

*p11 1

1 *q1

*q11

valid CPU 2 CPU 3

SM

SL

SM

SL

CPU 4

SM

SL

CPU 1

SM

SL

Page 65: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

65

Sub-epochs

*p=*q==*p

=*q

*p1

*q1

*q1

valid CPU 1b CPU 1c

SM

SL

SM

SL

CPU 1d

SM

SL

CPU 1a

SM

SL

1

1*q=

1

1*p=

*p1 1

1

1a

1b

1c

1d …………

Uses more epoch contexts Detection/buffering/rewind is “free” More replication:

Speculative victim cache

Page 66: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

66

get_page() wrapper

page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret;

tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut);

ret = get_page(id);

tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation()

return ret;}

Wraps get_page()

Page 67: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

67

get_page() wrapper

page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret;

tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut);

ret = get_page(id);

tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation()

return ret;}

No violations while calling get_page()

Page 68: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

68

May get bad input data from speculative thread!

get_page() wrapper

page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret;

tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut);

ret = get_page(id);

tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation()

return ret;}

Page 69: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

69

get_page() wrapper

page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret;

tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut);

ret = get_page(id);

tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation()

return ret;}

Only one epoch per transaction at a time

Page 70: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

70

How to undo get_page()

get_page() wrapper

page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret;

tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut);

ret = get_page(id);

tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation()

return ret;}

Page 71: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

71

get_page() wrapper

page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret;

tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut);

ret = get_page(id);

tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation()

return ret;}

Isolated Undoing this operation

does not cause cascading aborts

Undoable Easy way to return

system to initial state

Can also be used for: Cursor management malloc()

Page 72: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

72

TPC-C Benchmark

Company

Warehouse 1 Warehouse W

District 1 District 2 District 10

Cust 1 Cust 2 Cust 3k

Page 73: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

73

TPC-C Benchmark

WarehouseW

DistrictW*10

CustomerW*30k

OrderW*30k+

Order LineW*300k+

New OrderW*9k+

HistoryW*30k+

StockW*100k

Item100k

10

3k

1+

1+

0-1

5-15

3+W

100k

Page 74: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

74

What is TLS?

while(cond) {

x = hash[i];

...

hash[j] = y;

...

}

Tim

e

= hash[3];

...

hash[10] =

...

= hash[19];

...

hash[21] =

...

= hash[33];

...

hash[30] =

...

= hash[10];

...

hash[25] =

...

Page 75: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

75

What is TLS?

while(cond) {

x = hash[i];

...

hash[j] = y;

...

}

Tim

e

= hash[3];

...

hash[10] =

...

= hash[19];

...

hash[21] =

...

= hash[33];

...

hash[30] =

...

= hash[10];

...

hash[25] =

...

Processor A Processor B Processor C Processor D

Thread 1 Thread 2 Thread 3 Thread 4

Page 76: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

76

What is TLS?

while(cond) {

x = hash[i];

...

hash[j] = y;

...

}

Tim

e

= hash[3];

...

hash[10] =

...

= hash[19];

...

hash[21] =

...

= hash[33];

...

hash[30] =

...

= hash[10];

...

hash[25] =

...

Processor A Processor B Processor C Processor D

Thread 1 Thread 2 Thread 3 Thread 4

Violation!

Page 77: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

77

What is TLS?

while(cond) {

x = hash[i];

...

hash[j] = y;

...

}

Tim

e

= hash[3];

...

hash[10] =

...attempt_commit()

= hash[19];

...

hash[21] =

...attempt_commit()

= hash[33];

...

hash[30] =

...attempt_commit()

Processor A Processor B Processor C Processor D

Thread 1 Thread 2 Thread 3 Thread 4

Violation!

= hash[10];

...

hash[25] =

...attempt_commit()

Page 78: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

78

What is TLS?

while(cond) {

x = hash[i];

...

hash[j] = y;

...

}

Tim

e

= hash[3];

...

hash[10] =

...attempt_commit()

= hash[19];

...

hash[21] =

...attempt_commit()

= hash[33];

...

hash[30] =

...attempt_commit()

= hash[10];

...

hash[25] =

...attempt_commit()

Processor A Processor B Processor C Processor D

Thread 1 Thread 2 Thread 3 Thread 4

Violation!

Redo

= hash[10];

...

hash[25] =

...attempt_commit()

Thread 4

Page 79: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

79

TLS Hardware Design

What’s new? Large threads Epochs will communicate Complex control flow Huge legacy code base

How does hardware change? Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences

Aggressive update propagation (implicit forwarding) Sub-epochs

Page 80: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

80

L1 Cache Line

SL bit “L2 cache knows this line has been speculatively

loaded” On violation or commit: clear

SM bit “This line contains speculative changes” On commit: clear On violation: SM Invalid

Otherwise, just like a normal cache

SL

SM

Valid

LR

U

Tag

Data

Page 81: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

81

escaping Speculation

Speculative epoch wants to make system visible change!

Ignore SM lines while escaped Stale bit

“This line may be outdated by speculative work.” On violation or commit: clear

SL

SM

Valid

LR

U

Tag

Data

Sta

le

Page 82: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

82

L1 to L2 communication

L2 sees all stores (write through) L2 sees first load of an epoch

NotifySL message

L2 can track data dependences!

Page 83: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

83

L1 Changes Summary

Add three bits to each line SL SM Stale

Modify tag match to recognize bits Add queue of NotifySL requests

SL

SM

Valid

LR

U

Tag

Data

Sta

le

Page 84: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

84

L2 Cache Line

SL

SM

Fin

e G

rain

ed

SM

Valid

Dir

ty

Exclu

siv

e

LR

U

Tag

Data

SL

SM

CPU1 CPU2

Cache line can be: Modified by one CPU Loaded by multiple CPUs

Page 85: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

85

Cache Line Conflicts

Three classes of conflict: Epoch 2 stores, epoch 1 loads

Need “old” version to load Epoch 1 stores, epoch 2 stores

Need to keep changes separate Epoch 1 loads, epoch 2 stores

Need to be able to discard line on violation

Need a way of storing multiple conflicting versions in the cache

Page 86: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

86

Cache line replication

On conflict, replicate line Split line into two copies Divide SM and SL bits at split point Divide directory bits at split point

Page 87: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

87

Replication Problems

Complicates line lookup Need to find all replicas and select “best” Best == most recent replica

Change management On write, update all later copies Also need to find all more speculative

replicas to check for violations On commit must get rid of stale lines

Invalidation Required Buffer (IRB)

Page 88: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

88

Victim Cache

How do you deal with a full cache set? Use a victim cache

Holds evicted lines without losing SM & SL bits

Must be fast every cache lookup needs to know: Do I have the “best” replica of this line?

Critical path Do I cause a violation?

Not on critical path

Page 89: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

89

Summary of Hardware Support

Sub-epochs Violations hurt less!

Shared cache TLS support Faster communication More room to store state

RAOs Don’t speculate on known operations Reduces amount of speculative state

Page 90: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

90

Summary of Hardware Changes

Sub-epochs Checkpoint register state Needs replicas in cache

Shared cache TLS support Speculative L1 Replication in L1 and L2 Speculative victim cache Invalidation Required Buffer

RAOs Suspend/resume speculation Mutexes “Undo list”

Page 91: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

91

TLS Execution

*p=

*q=

=*p

R2

Violation!

=*p

=*q

CPU

L1 $

L2 $

Rest of memory systemRest of memory system

CPU

L1 $

CPU

L1 $

CPU

L1 $L1 $ L1 $ L1 $ L1 $

*p

Invalidation

*p*p

*q

*p

*q*q

Page 92: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

92

Page 93: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

93

Page 94: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

94

Page 95: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

95

Problems with Old Cache Design

Database epochs are large L1 cache not large enough

Sub-epochs add more state L1 cache not associative enough

Database epochs communicate L1 cache only communicates committed

data

Page 96: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

96

Intro Summary

TLS makes intra-transaction parallelism easy

Divide transaction into epochs Hardware support:

Detect violations Restart to recover

Sub-epochs mitigate penalty Buffer state

New process: Modify software

avoid violations improve performance

Page 97: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

97

Money! pwhile()

{}

Money!pwhile()

{}

Money!pwhile()

{}

Money! pwhile()

{}

Money!pwhile()

{}

Money!pwhile()

{}

Money!pwhile()

{}

Money!pwhile()

{}

Money! pwhile()

{}

Money!pwhile()

{}

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

The Many Faces of Ogg

Must tune reorder buffer…

Page 98: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

98

Must tune reorder buffer…

The Many Faces of Ogg

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Money!pwhile()

{}

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

Page 99: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

99

Removing Bottlenecks

Three general techniques: Partition data structures

malloc Postpone operations until non-

speculative Latches and locks, log entries

Handle speculation manually Buffer pool

Page 100: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

100

Bottlenecks Encoutered

Buffer pool Latches & Locks Malloc/free Cursor queues Error checks False sharing B-tree performance optimization Log entries

Page 101: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

101

Must tune reorder buffer…

The Many Faces of Ogg

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Money! pwhile()

{}

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

Page 102: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

102

Performance on 4 CPUs

Unmodified benchmark: Modified benchmark:

Page 103: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

103

Incremental Parallelization

Tim

e4 CPUs

Page 104: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

104

ScalingUnmodified benchmark: Modified benchmark:

Page 105: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

105

Parallelization is Hard

Programmer Effort

Perf

orm

ance

Im

pro

vem

ent

Hand Parallelization

What we want

Parallelizing Compiler

Tuning

Tuning

TuningTuning

Page 106: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

106

Case Study: New Order (TPC-C)

Begin transaction {

} End transaction

Page 107: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

107

Case Study: New Order (TPC-C)

Begin transaction { Read customer info Read & increment order # Create new order

} End transaction

Page 108: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

108

Case Study: New Order (TPC-C)

Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info }} End transaction

Page 109: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

109

Case Study: New Order (TPC-C)

Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info }} End transaction

80% of transactionexecution time

Page 110: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

110

Case Study: New Order (TPC-C)

Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info }} End transaction

80% of transactionexecution time

Page 111: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

111

The Many Faces of Ogg

Duh…pwhile()

{}

while(too_slow)

make_faster();

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Duh…pwhile()

{}

Money!pwhile()

{}

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

Must tune reorder buffer…

Page 112: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Step 2: Changing the Software

Page 113: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

113

No problem!

Loop is easy to parallelize using TLS! Not really Calls into DBMS invoke complex

operations Ogg needs to do some work

Many operations in DBMS are parallel Not written with TLS in mind!

Page 114: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

114

Resource Management

Mutexes acquired and released

Locks locked and unlocked

Cursors pushed and popped from free stack

Memory allocated and freed

Buffer pool entries Acquired and released

Page 115: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

115

Mutexes: Deadlock?

Problem: Re-ordered acquire/release operations!

Possibly introduced deadlock?

Solutions: Avoidance:

Static acquire order Recovery:

Detect deadlock and violate

Page 116: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

116

Locks

Like mutexes, but: Allows multiple readers No memory overhead when not held Often held for much longer

Treat similarly to mutexes

Page 117: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

117

Cursors

Used for traversing B-trees Pre-allocated, kept in pools

Page 118: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

118

Maintaining Cursor Pool

Get

Use

Release

head

Page 119: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

119

Maintaining Cursor Pool

Get

Use

Release

head

Page 120: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

120

Maintaining Cursor Pool

Get

Use

Release

head

Page 121: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

121

Maintaining Cursor Pool

Get

Use

Release

head

Get

Use

Release

Violation!

Page 122: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

122

Parallelizing Cursor Pool

Use per-CPU pools: Modify code: each CPU gets its own

pool No sharing == no violations! Requires cpuid() instruction

Get

Use

Release

head Get

Use

Release

head

Page 123: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

123

Memory Allocation

Problem: malloc() metadata causes

dependences

Solutions: Per-cpu memory pools Parallelized free list

Page 124: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

124

The Log

Append records to global log Appending causes dependence Can’t parallelize:

Global log sequence number (LSN) Generate log records in buffers Assign LSNs when homefree

Page 125: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

125

B-Trees

Leaf pages contain free space counts Inserts of random records – o.k. Inserting adjacent records

Dependence on decrementing count Page splits

Infrequent

Page 126: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

126

Other Dependences

Statistics gathering Error checks False sharing

Page 127: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

127

Related Work

Lots of work in TLS: Multiscalar (Wisconsin) Hydra (Stanford) IACOMA (Illinois) RAW (MIT)

Hand parallelizing using TLS: Manohar Prabhu and Kunle Olukotun

(PPoPP’03)

Page 128: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Any questions?

Page 129: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

129

Why is this a problem?

B-tree insertion into ORDERLINE table Key is ol_n

DBMS does not know that keys will be sequential

Each insert usually updates the same btree page

for(ol_n=0; ol_n<15; ol_n++) { INSERT into ORDERLINE (ol_n, ol_item, ol_cnt) VALUES (:ol_n, :ol_item, :ol_cnt);}

Page 130: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

130

Sequential Btree Inserts

4

free free

1

item free

free free

free free

4

item item

item item

free free

3

item item

item free

free free

2

item item

free free

free free

Page 131: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

132

Outline

Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences

Aggressive update propagation (implicit forwarding)

Sub-epochs Results and analysis

Page 132: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

133

Outline

Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences

Aggressive update propagation (implicit forwarding)

Sub-epochs Results and analysis

Page 133: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

134

Tolerating dependences

Aggressive update propagation Get for free!

Sub-epochs Periodically checkpoint epochs Every N instructions?

Picking N may be interesting Perhaps checkpoints could be set before the

location of previous violations?

Page 134: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

135

Outline

Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences

Aggressive update propagation (implicit forwarding)

Sub-epochs Results and analysis

Page 135: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

136

Why not faster?

Possible reasons: Idle cpus RAO mutexes Violations Cache effects Data dependences

Page 136: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

137

Why not faster?

Possible reasons: Idle cpus

9 epochs/region average Two bundles of four and one of one ¼ of cpu cycles wasted!

RAO mutexes Violations Cache effects Data dependences

Page 137: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

138

Why not faster?

Possible reasons: Idle cpus RAO mutexes

Not implemented yet Ooops!

Violations Cache effects Data dependences

Page 138: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

139

Why not faster?

Possible reasons: Idle cpus RAO mutexes Violations

21/969 epochs violated Distance 1 “magic synchronized” 2.2Mcycles (over 4 cpus)

About 1.5%

Cache effects Data dependences

Page 139: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

140

Why not faster?

Possible reasons: Idle cpus RAO mutexes Violations Cache effects

Deserves its own slide. Data dependences

Page 140: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

141

Cache effects of speculation

Only 20% of references are speculative! Speculative references have small impact

on non-speculative hit rate (<1%) Speculative refs miss a lot in L1

9-15% for reads, 2-6% for writes L2 saw HUGE increase in traffic

152k refs to 3474k refs Spec/non spec lines are thrashing from L1s

Page 141: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

142

Why not faster?

Possible reasons: Idle cpus RAO mutexes Violations Cache effects Data dependences

Oh yeah! Btree item count

Split up btree insert? alloc and write Do alloc as RAO Needs more thought

Page 142: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

143

L2 Cache Line

SL

SM

Fin

e G

rain

ed

SM

Valid

Dir

ty

Exclu

siv

e

LR

U

Tag

Data

SL

SM

CPU1 CPU2

Set

1S

et

2

Page 143: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

144

Why are you here?

Want faster database systems Have funky new hardware –

Thread Level Speculation (TLS)

How can we apply TLS todatabase systems?

Side question: Is this a VLDB or an ASPLOS talk?

Page 144: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

145

How?

Divide transaction into TLS-threads Run TLS-threads in parallel;

maintain sequential semantics Profit!

Page 145: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

146

Why parallelize transactions?

Decrease transaction latency Increase concurrency while avoiding

concurrency control bottleneck A.k.a.: use more CPUs, same # of

xactions

The obvious: Database performance matters

Page 146: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

147

Shopping List

What do we need? (research scope)

Cheap hardware Thread Level Speculation (TLS)

Minor changes allowed.

Important database application TPC-C

Almost no changes allowed!

Modular database system BerkeleyDB

Some changes allowed.

Page 147: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

148

Outline

TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

Page 148: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

149

Outline

TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

Page 149: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

150

What’s new?

Database operations are: Large Complex

Large TLS-threads Lots of dependences Difficult to analyze

Want: Programmer optimization effort =

faster program

Page 150: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

151

Hardware changes summary

Must tolerate dependences Prediction? Implicit forwarding?

May need larger caches May need larger associativity

Page 151: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

152

Outline

TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

Page 152: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

153

Parallelization Strategy

1. Pick a benchmark2. Parallelize a loop3. Analyze dependences4. Optimize away dependences5. Evaluate performance6. If not satisfied, goto 3

Page 153: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

154

Outline

TLS Hardware The Benchmark (TPC-C) Changing the database system

Resource management The log B-trees False sharing

Results Conclusions

Page 154: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

155

Outline

TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

Page 155: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

156

Results

Viola simulator: Single CPI Perfect violation prediction No memory system 4 cpus Exhaustive dependence tracking

Currently working on an out-of-order superscalar simulation (cello)

10 transaction warm-up Measure 100 transactions

Page 156: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

157

Outline

TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

Page 157: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

158

Conclusions

TLS can improve transaction latency Violation predictors important

Iff dependences must be tolerated TLS makes hand parallelizing easier

Page 158: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

159

Improving Database Performance

How to improve performance: Parallelize transaction Increase number of concurrent

transactions

Both of these require independence of database operations!

Page 159: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

160

Case Study: New Order (TPC-C)

Begin transaction {

} End transaction

Page 160: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

161

Case Study: New Order (TPC-C)

Begin transaction { Read customer info (customer, warehouse) Read & increment order # (district) Create new order (orders, neworder)

} End transaction

Page 161: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

162

Case Study: New Order (TPC-C)

Begin transaction { Read customer info (customer, warehouse) Read & increment order # (district) Create new order (orders, neworder) For each item in order { Get item info (item) Decrement count in stock (stock) Record order info (orderline) }} End transaction

Page 162: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

163

Case Study: New Order (TPC-C)

Begin transaction { Read customer info (customer, warehouse) Read & increment order # (district) Create new order (orders, neworder) For each item in order { Get item info (item) Decrement count in stock (stock) Record order info (orderline) }} End transaction

Parallelizethis loop

Page 163: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

164

Case Study: New Order (TPC-C)

Begin transaction { Read customer info (customer, warehouse) Read & increment order # (district) Create new order (orders, neworder) For each item in order { Get item info (item) Decrement count in stock (stock) Record order info (orderline) }} End transaction

Parallelizethis loop

Page 164: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

165

Implementing on a Real DB

Using BerkeleyDB “Table” == “Database” Give database any arbitrary key will return arbitrary data (bytes) Use structs for keys and rows Database provides ACID through:

Transactions Locking (page level) Storage management

Provides indexing using b-trees

Page 165: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

166

Parallelizing a Transaction

For each item in order { Get item info (item) Decrement count in stock (stock) Record order info (order line)}

Page 166: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

167

Parallelizing a Transaction

For each item in order { Get item info (item) Decrement count in stock (stock) Record order info (order line)}

•Get cursor from pool•Use cursor to traverse b-tree•Find row, lock page for row•Release cursor to pool

•Get cursor from pool•Use cursor to traverse b-tree•Find row, lock page for row•Release cursor to pool

Page 167: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

168

Maintaining Cursor Pool

Get

Use

Release

head

Page 168: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

169

Maintaining Cursor Pool

Get

Use

Release

head

Page 169: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

170

Maintaining Cursor Pool

Get

Use

Release

head

Page 170: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

171

Maintaining Cursor Pool

Get

Use

Release

head

Get

Use

Release

Violation!

Page 171: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

172

Parallelizing Cursor Pool 1

Use per-CPU pools: Modify code: each CPU gets its own

pool No sharing == no violations! Requires cpuid() instruction

Get

Use

Release

head Get

Use

Release

head

Page 172: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

173

Parallelizing Cursor Pool 2

Dequeue and enqueue: atomic and unordered

Delay enqueue until end of thread Forces separate pools Avoids modification

of data structGet

Use

Release

head Get

Use

Release

Page 173: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

174

Parallelizing Cursor Pool 3

Atomic unordered dequeue & enqueue

Cursor struct is “TLS unordered”

Struct defined as a byte range in memory

Get

Use

Release

headGet

Use

Release

Get

Use

Release

Get

Use

Release

Page 174: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

175

Parallelizing Cursor Pool 4

Mutex protect dequeue & enqueue;declare pointer to cursor struct to be “TLS unordered”

Any access through pointer does not have TLS applied

Pointer is tainted, any copies of it keep this property

Page 175: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

176

Problems with 3 & 4

What exactly is the boundary of a structure?

How do you express the concept of object in a loosely-typed language like C?

A byte range or a pointer is only an approximation.

Dynamically allocated sub-components?

Page 176: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

177

Mutexes in a TLS world

Two types of threads: “real” threads TLS threads

Two types of mutexes: Inter-real-thread Inter-TLS-thread

Page 177: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

178

Inter-real-thread Mutexes

Acquire == get mutex for all TLS threads

Release == release for current TLS thread May still be held by another TLS thread!

Page 178: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

179

Inter-TLS-thread Mutexes

Should never interact between two real threads

Implies no TLS ordering between TLS threads while mutex is held

But what do to on a violation? Can’t just throw away changes to memory Must undo operations performed in

critical section

Page 179: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

180

Parallelizing Databases using TLS

Split transactions into threads Threads created are large

60k+ instructions 16kB of speculative state More dependences between threads

How do we design a machine which can handle these large threads?

Page 180: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

181

The “Old” Way

P

L1 $

L2 $

P

L1 $

P

L1 $

P

L1 $

Committed state

L3 $

P

L1 $

L2 $

P

L1 $

P

L1 $

P

L1 $

Memory System

Speculative state

Page 181: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

182

The “Old” Way

Advantages Each epoch has its own L1 cache Epoch state does not intermix

Disadvantages L1 cache is too small! Full cache == dead meat No shared speculative memory

Page 182: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

183

The “New” Way

L2 cache is huge! State of the art in caches, Power5:

1.92MB 10-way L2 32kB 4-way L1

Shared speculative memory “for free” Keeps TLS logic off of the critical path

Page 183: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

184

TLS Shared L2 Design

L1 Write-through write-no-allocate [CullerSinghGupta99]

Easy to understand and reason about Writes visible to L2 – simplifies shared

speculative memory L2 cache: shared cache architecture with

replication Rest of memory: distributed TLS coherence

Page 184: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

185

TLS Shared L2 Design

P

L1 $

L2 $

P

L1 $

P

L1 $

P

L1 $

“Real” speculative state

L3 $

P

L1 $

L2 $

P

L1 $

P

L1 $

P

L1 $

Memory System

Cached speculative state

Explain from the top down

Page 185: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

186

TLS Shared L2 Design

P

L1 $

L2 $

P

L1 $

P

L1 $

P

L1 $

“Real” speculative state

L3 $

P

L1 $

L2 $

P

L1 $

P

L1 $

P

L1 $

Memory System

Cached speculative state

Page 186: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Part II: Dealing with dependences

Page 187: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

188

Predictor Design

How do you design a predictor that: Identifies violating loads Identifies the last store that causes them Only triggers when they cause a problem Has very very high accuracy

???

Page 188: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

189

Sub-epoch design

Like checkpointing Leave “holes” in epoch # space Every 5k instructions start a new

epoch Uses more cache to buffer changes

More strain on associativity/victim cache Uses more epoch contexts

Page 189: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

190

Summary

Supporting large epochs needs: Buffer state in L2 instead of L1 Shared speculative memory Replication Victim cache Sub-epochs

Page 190: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Any questions?