Upload
chana
View
36
Download
2
Embed Size (px)
DESCRIPTION
Optimistic Intra-Transaction Parallelism using Thread Level Speculation. Chris Colohan 1 , Anastassia Ailamaki 1 , J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie Mellon University 2 University of Toronto 3 Intel Research Pittsburgh. Chip Multiprocessors are Here!. - PowerPoint PPT Presentation
Citation preview
Optimistic Intra-Transaction Parallelism using Thread Level Speculation
Chris Colohan1, Anastassia Ailamaki1,J. Gregory Steffan2 and Todd C. Mowry1,3
1Carnegie Mellon University2University of Toronto
3Intel Research Pittsburgh
2
Chip Multiprocessors are Here!
2 cores now, soon will have 4, 8, 16, or 32 Multiple threads per core How do we best use them?
IBM Power 5
AMD Opteron
Intel Yonah
3
Multi-Core Enhances Throughput
Database ServerUsers
Transactions DBMS Database
Cores can run concurrent transactions and improve
throughput
Cores can run concurrent transactions and improve
throughput
4
Using Multiple CoresDatabase ServerUsers
Transactions DBMS Database
Can multiple cores improvetransaction latency?
Can multiple cores improvetransaction latency?
5
Parallelizing transactions
SELECT cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line;}
DBMS
Intra-query parallelism Used for long-running queries (decision support) Does not work for short queries
Short queries dominate in commercial workloads
6
Parallelizing transactions
SELECT cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line;}
DBMS
Intra-transaction parallelism Each thread spans multiple queries
Hard to add to existing systems! Need to change interface, add latches and locks,
worry about correctness of parallel execution…
7
Parallelizing transactions
SELECT cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line;}
DBMS
Intra-transaction parallelism Breaks transaction into threads
Hard to add to existing systems! Need to change interface, add latches and locks,
worry about correctness of parallel execution…
Thread Level Speculation (TLS)makes parallelization easier.
Thread Level Speculation (TLS)makes parallelization easier.
8
Thread Level Speculation (TLS)
*p=
*q=
=*p
=*q
Sequential
Tim
e
Parallel
*p=
*q=
=*p
=*q
=*p
=*q
Epoch 1 Epoch 2
9
Thread Level Speculation (TLS)
*p=
*q=
=*p
=*q
Sequential
Tim
e
*p=
*q=
=*p
R2
Violation!
=*p
=*q
Parallel
Use epochs
Detect violations Restart to
recover Buffer state
Oldest epoch: Never restarts No buffering
Worst case: Sequential
Best case: Fully parallelData dependences limit performance.Data dependences limit performance.
Epoch 1 Epoch 2
10
TransactionProgrammer
DBMS Programmer
Hardware Developer
A Coordinated Effort
Choose epoch boundaries
Choose epoch boundaries
Remove performance bottlenecks
Remove performance bottlenecks
Add TLS support to architecture
Add TLS support to architecture
11
So what’s new?
Intra-transaction parallelism Without changing the transactions With minor changes to the DBMS Without having to worry about locking Without introducing concurrency bugs With good performance
Halve transaction latency on four cores
12
Related Work
Optimistic Concurrency Control (Kung82)
Sagas (Molina&Salem87)
Transaction chopping (Shasha95)
13
Outline
Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions
14
Case Study: New Order (TPC-C)
Only dependence is the quantity field Very unlikely to occur (1/100,000)
GET cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line;}
78% of transactionexecution time
15
Case Study: New Order (TPC-C)GET cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line;}
GET cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;
TLS_foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line;}
GET cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;
TLS_foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line;}
16
Outline
Introduction Related work Dividing transactions into epochs Removing bottlenecks in the
DBMS Results Conclusions
17
Dependences in DBMSTim
e
18
Dependences in DBMSTim
e
Dependences serialize execution!
Example: statistics gathering pages_pinned++ TLS maintains serial ordering of
increments To remove, use per-CPU counters
Performance tuning: Profile execution Remove bottleneck dependence Repeat
19
Buffer Pool Management
CPU
Buffer Pool
get_page(5)
ref: 1
put_page(5)
ref: 0
20
get_page(5)put_page(5)
Buffer Pool Management
CPU
Buffer Pool
get_page(5)
ref: 0
put_page(5)
Tim
e
get_page(5)
put_page(5)
TLS ensures first epoch gets page
first.Who cares?
TLS ensures first epoch gets page
first.Who cares?
TLS maintains original load/store order
Sometimes this is not needed
get_page(5)put_page(5)
21
Buffer Pool Management
CPU
Buffer Pool
get_page(5)
ref: 0
put_page(5)
Tim
e
get_page(5)
put_page(5)
= Escape SpeculationIsolated: undoing get_page will not affect other transactionsUndoable: have an operation (put_page) which returns the system to its initial state
Isolated: undoing get_page will not affect other transactionsUndoable: have an operation (put_page) which returns the system to its initial state
• Escape speculation• Invoke operation• Store undo function• Resume speculation
• Escape speculation• Invoke operation• Store undo function• Resume speculation
get_page(5)put_page(5)
put_page(5)get_page(5)
22
Buffer Pool Management
CPU
Buffer Pool
get_page(5)
ref: 0
put_page(5)
Tim
e
get_page(5)
put_page(5)
get_page(5)put_page(5)
Not undoable!
Not undoable!
get_page(5)put_page(5)
= Escape Speculation
23
Buffer Pool Management
CPU
Buffer Pool
get_page(5)
ref: 0
put_page(5)
Tim
e
get_page(5)
put_page(5)
get_page(5)
put_page(5)
Delay put_page until end of epoch Avoid dependence
= Escape Speculation
24
Removing Bottleneck Dependences
We introduce three techniques: Delay operations until non-speculative
Mutex and lock acquire and release Buffer pool, memory, and cursor release Log sequence number assignment
Escape speculation Buffer pool, memory, and cursor allocation
Traditional parallelization Memory allocation, cursor pool, error
checks, false sharing
25
Outline
Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions
26
Experimental Setup
Detailed simulation Superscalar, out-of-
order, 128 entry reorder buffer
Memory hierarchy modeled in detail
TPC-C transactions on BerkeleyDB In-core database Single user Single warehouse Measure interval of 100
transactions Measuring latency not
throughput
CPU
32KB4-wayL1 $
Rest of memory system
Rest of memory system
CPU
32KB4-wayL1 $
CPU
32KB4-wayL1 $
CPU
32KB4-wayL1 $
2MB 4-way L2 $
27
Optimizing the DBMS: New Order
0
0.25
0.5
0.75
1
1.25
Seque
ntial
No Opt
imiza
tions
Latch
es
Lock
s
Mall
oc/F
ree
Buffe
r Poo
l
Curso
r Que
ue
Error C
heck
s
False
Sharin
g
B-Tre
e
Logg
ing
Tim
e (n
orm
aliz
ed)
Idle CPU
ViolatedCache Miss
Busy
Cache misses
increase
Cache misses
increase
Other CPUs not helping
Other CPUs not helping
Can’t optimize
much more
Can’t optimize
much more
26% improvemen
t
26% improvemen
t
28
Optimizing the DBMS: New Order
0
0.25
0.5
0.75
1
1.25
Seque
ntial
No Opt
imiza
tions
Latch
es
Lock
s
Mall
oc/F
ree
Buffe
r Poo
l
Curso
r Que
ue
Error C
heck
s
False
Sharin
g
B-Tre
e
Logg
ing
Tim
e (n
orm
aliz
ed)
Idle CPU
ViolatedCache Miss
Busy
This process took me 30 days and <1200 lines of code.
This process took me 30 days and <1200 lines of code.
29
Other TPC-C Transactions
0
0.25
0.5
0.75
1
New Order Delivery Stock Level Payment Order Status
Tim
e (
no
rma
lize
d)
Idle CPU
FailedCache Miss
Busy
3/5 Transactions speed up by 46-66%
30
Conclusions
TLS makes intra-transaction parallelism practical Reasonable changes to transaction,
DBMS, and hardware Halve transaction latency
31
Needed backup slides (not done yet)
2 proc. Results Shared caches may change how you
want to extract parallelism! Just have lots of transactions: no sharing TLS may have more sharing
Any questions?
For more information, see:www.colohan.com
Backup Slides Follow…
LATCHES
35
Latches
Mutual exclusion between transactions Cause violations between epochs
Read-test-write cycle RAW Not needed between epochs
TLS already provides mutual exclusion!
36
Latches: Aggressive Acquire
Acquirelatch_cnt++…work…latch_cnt--
Homefree
latch_cnt++…work…(enqueue release)
Commit worklatch_cnt--
Homefree
latch_cnt++…work…(enqueue release)
Commit worklatch_cnt--Release
Larg
e c
riti
cal secti
on
37
Latches: Lazy Acquire
Acquire…work…Release
Homefree
(enqueue acquire) …work…(enqueue release)
AcquireCommit workRelease
Homefree
(enqueue acquire)…work…(enqueue release)
AcquireCommit workRelease
Sm
all c
riti
cal secti
on
s
HARDWARE
39
TLS in Database Systems
Non-DatabaseTLS
Tim
e
TLS in DatabaseSystems
Large epochs:• More dependences
• Must tolerate
• More state• Bigger buffers
Large epochs:• More dependences
• Must tolerate
• More state• Bigger buffers
Concurrenttransactions
40
Feedback Loop I know this is
parallel!
for() { do_work();}
par_for() { do_work();}
Must…Make…Faster
think
feed back feed back feed back feed back feed back feed back feed back feed back feed back feed back feed
41
Violations == Feedback
*p=
*q=
=*p
=*q
Sequential
Tim
e
*p=
*q=
=*p
R2
Violation!
=*p
=*q
Parallel
0x0FD80xFD200x0FC00xFC18
Must…Make…Faster
Must…Make…Faster
42
Eliminating Violations
*p=
*q=
=*p
R2
Violation!
=*p
=*q
Parallel
*q==*q
=*q
Violation!
Eliminate *p Dep.
Tim
e
0x0FD80xFD200x0FC00xFC18
Optimization maymake slower?
43
Tolerating Violations: Sub-epochs
Tim
e
*q=Violation!
Sub-epochs
=*q
=*q
*q==*q
=*q
Violation!
Eliminate *p Dep.
44
Sub-epochs
Started periodically by hardware How many? When to start?
Hardware implementation Just like epochs
Use more epoch contexts No need to check
violations between sub-epochs within an epoch
*q=Violation!
Sub-epochs
=*q
=*q
45
Old TLS Design
CPU
L1 $
L2 $
Rest of memory systemRest of memory system
CPU
L1 $
CPU
L1 $
CPU
L1 $L1 $ L1 $ L1 $ L1 $
Buffer speculative state in write back L1
cache
Invalidation
Detect violations through
invalidations
Rest of system only sees committed
data
Restart by invalidating speculative lines
Problems:
• L1 cache not large enough• Later epochs only get values on commit
Problems:
• L1 cache not large enough• Later epochs only get values on commit
46
New Cache Design
CPU
L1 $
L2 $
Rest of memory systemRest of memory system
CPU
L1 $
CPU
L1 $
CPU
L1 $
Buffer speculative and non-speculative state for all epochs
in L2
Speculative writes immediately visible
to L2 (and later epochs)
Detect violations at lookup time
Invalidation coherence between L2 caches
L1 $ L1 $ L1 $ L1 $
L2 $
Invalidation
Restart by invalidating speculative lines
47
New Features
CPU
L1 $
L2 $
Rest of memory systemRest of memory system
CPU
L1 $
CPU
L1 $
CPU
L1 $L1 $ L1 $ L1 $ L1 $
L2 $
Speculative state in L1 and L2 cache
Cache line replication (versions)
Data dependence tracking within cache
Speculative victim cache
New
!
New
!
48
Scaling
0
0.25
0.5
0.75
1
Seque
ntial
2 CPUs
4 CPUs
8 CPUs
Tim
e (n
orm
aliz
ed)
0
0.25
0.5
0.75
1
Seque
ntial
2 CPUs
4 CPUs
8 CPUs
Modified with 50-150 items/transaction
Idle CPU
Failed Speculation
IUO Mutex Stall
Cache Miss
Instruction Execution
49
Evaluating a 4-CPU system
0
0.25
0.5
0.75
1
Seque
ntial
TLS S
eq
No Sub
-epo
ch
Baseli
ne
No Spe
culat
ion
Tim
e (n
orm
aliz
ed)
Idle CPU
Failed Speculation
IUO Mutex Stall
Cache Miss
Instruction Execution
Original benchmark run on
1 CPU
Original benchmark run on
1 CPU
Parallelized benchmark run on 1
CPU
Parallelized benchmark run on 1
CPU
Without sub-epoch support
Without sub-epoch support
Parallel execution
Parallel execution
Ignore violations (Amdahl’s Law limit)
Ignore violations (Amdahl’s Law limit)
50
Sub-epochs: How many/How big?
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Number of Sub-epochs/Instructions per Sub-epoch
Tim
e (n
orm
aliz
ed)
• Supporting more sub-epochs is better• Spacing depends on location of violations
• Even spacing is good enough
• Supporting more sub-epochs is better• Spacing depends on location of violations
• Even spacing is good enough
51
Query Execution
Actions taken by a query: Bring pages into buffer pool Acquire and release latches & locks Allocate/free memory Allocate/free and use cursors Use B-trees Generate log entries
These generate violations.These generate violations.
52
Applying TLS
1. Parallelize loop2. Run benchmark3. Remove bottleneck4. Go to 2 T
ime
53
Outline
Hardware Developer
TransactionProgrammer
DBMS Programmer
54
Violation Prediction
*q==*qDone
Predict Dependences
Pre
dic
tor
*q==*q
=*q
Violation!
Eliminate R1/W2
Tim
e
55
Violation Prediction
*q==*qDone
Predict Dependences
Pre
dic
tor
Tim
e
Predictor problems: Large epochs
many predictions Failed prediction
violation Incorrect prediction
large stall
Two predictors required:
Last store Dependent load
56
TLS Execution
*p=
*q=
=*p
R2
Violation!
=*p
=*q
CPU 1
L1 $
L2 $
Rest of memory systemRest of memory system
CPU 2
L1 $
CPU 3
L1 $
CPU 4
L1 $L1 $ L1 $ L1 $ L1 $
57
TLS Execution
*p=
*q=
=*p
R2
Violation!
=*p
=*q
*p11
*s1
*t
1
1
1
1valid CPU 2 CPU 3
SM
SL
SM
SL
CPU 4
SM
SL
CPU 1
SM
SL
58
TLS Execution
*p=
*q=
=*p
R2
Violation!
=*p
=*q
*p11
*s1
*t
1
1
1
1
1
valid CPU 2 CPU 3
SM
SL
SM
SL
CPU 4
SM
SL
CPU 1
SM
SL
59
TLS Execution
*p=
*q=
=*p
R2
Violation!
=*p
=*q
*p11 1
valid CPU 2 CPU 3
SM
SL
SM
SL
CPU 4
SM
SL
CPU 1
SM
SL
60
TLS Execution
*p=
*q=
=*p
R2
Violation!
=*p
=*q
*q1
*p11 1
valid CPU 2 CPU 3
SM
SL
SM
SL
CPU 4
SM
SL
CPU 1
SM
SL
1
61
TLS Execution
*p=
*q=
=*p
R2
Violation!
=*p
=*q
1 *q1
*p11 1
valid CPU 2 CPU 3
SM
SL
SM
SL
CPU 4
SM
SL
CPU 1
SM
SL
1
62
Replication
*p=
*q=
=*p
R2
Violation!
=*p
=*q
*q=
*p11 1
1 *q11 1 *q11
Can’t invalidate line if it contains two epoch’s changes
Can’t invalidate line if it contains two epoch’s changes
valid CPU 2 CPU 3
SM
SL
SM
SL
CPU 4
SM
SL
CPU 1
SM
SL
63
Replication
*p=
*q=
=*p
R2
Violation!
=*p
=*q
*q=
1
*p11 1
1 *q1
*q11
valid CPU 2 CPU 3
SM
SL
SM
SL
CPU 4
SM
SL
CPU 1
SM
SL
64
Replication
*p=
*q=
=*p
R2
Violation!
=*p
=*q
*q=
Makes epochs independent Enables sub-epochs
1
*p11 1
1 *q1
*q11
valid CPU 2 CPU 3
SM
SL
SM
SL
CPU 4
SM
SL
CPU 1
SM
SL
65
Sub-epochs
*p=*q==*p
=*q
*p1
*q1
*q1
valid CPU 1b CPU 1c
SM
SL
SM
SL
CPU 1d
SM
SL
CPU 1a
SM
SL
1
1*q=
1
1*p=
*p1 1
1
1a
1b
1c
1d …………
Uses more epoch contexts Detection/buffering/rewind is “free” More replication:
Speculative victim cache
66
get_page() wrapper
page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret;
tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut);
ret = get_page(id);
tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation()
return ret;}
Wraps get_page()
67
get_page() wrapper
page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret;
tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut);
ret = get_page(id);
tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation()
return ret;}
No violations while calling get_page()
68
May get bad input data from speculative thread!
get_page() wrapper
page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret;
tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut);
ret = get_page(id);
tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation()
return ret;}
69
get_page() wrapper
page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret;
tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut);
ret = get_page(id);
tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation()
return ret;}
Only one epoch per transaction at a time
70
How to undo get_page()
get_page() wrapper
page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret;
tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut);
ret = get_page(id);
tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation()
return ret;}
71
get_page() wrapper
page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret;
tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut);
ret = get_page(id);
tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation()
return ret;}
Isolated Undoing this operation
does not cause cascading aborts
Undoable Easy way to return
system to initial state
Can also be used for: Cursor management malloc()
72
TPC-C Benchmark
Company
Warehouse 1 Warehouse W
District 1 District 2 District 10
Cust 1 Cust 2 Cust 3k
73
TPC-C Benchmark
WarehouseW
DistrictW*10
CustomerW*30k
OrderW*30k+
Order LineW*300k+
New OrderW*9k+
HistoryW*30k+
StockW*100k
Item100k
10
3k
1+
1+
0-1
5-15
3+W
100k
74
What is TLS?
while(cond) {
x = hash[i];
...
hash[j] = y;
...
}
Tim
e
= hash[3];
...
hash[10] =
...
= hash[19];
...
hash[21] =
...
= hash[33];
...
hash[30] =
...
= hash[10];
...
hash[25] =
...
75
What is TLS?
while(cond) {
x = hash[i];
...
hash[j] = y;
...
}
Tim
e
= hash[3];
...
hash[10] =
...
= hash[19];
...
hash[21] =
...
= hash[33];
...
hash[30] =
...
= hash[10];
...
hash[25] =
...
Processor A Processor B Processor C Processor D
Thread 1 Thread 2 Thread 3 Thread 4
76
What is TLS?
while(cond) {
x = hash[i];
...
hash[j] = y;
...
}
Tim
e
= hash[3];
...
hash[10] =
...
= hash[19];
...
hash[21] =
...
= hash[33];
...
hash[30] =
...
= hash[10];
...
hash[25] =
...
Processor A Processor B Processor C Processor D
Thread 1 Thread 2 Thread 3 Thread 4
Violation!
77
What is TLS?
while(cond) {
x = hash[i];
...
hash[j] = y;
...
}
Tim
e
= hash[3];
...
hash[10] =
...attempt_commit()
= hash[19];
...
hash[21] =
...attempt_commit()
= hash[33];
...
hash[30] =
...attempt_commit()
Processor A Processor B Processor C Processor D
Thread 1 Thread 2 Thread 3 Thread 4
Violation!
= hash[10];
...
hash[25] =
...attempt_commit()
78
What is TLS?
while(cond) {
x = hash[i];
...
hash[j] = y;
...
}
Tim
e
= hash[3];
...
hash[10] =
...attempt_commit()
= hash[19];
...
hash[21] =
...attempt_commit()
= hash[33];
...
hash[30] =
...attempt_commit()
= hash[10];
...
hash[25] =
...attempt_commit()
Processor A Processor B Processor C Processor D
Thread 1 Thread 2 Thread 3 Thread 4
Violation!
Redo
= hash[10];
...
hash[25] =
...attempt_commit()
Thread 4
79
TLS Hardware Design
What’s new? Large threads Epochs will communicate Complex control flow Huge legacy code base
How does hardware change? Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences
Aggressive update propagation (implicit forwarding) Sub-epochs
80
L1 Cache Line
SL bit “L2 cache knows this line has been speculatively
loaded” On violation or commit: clear
SM bit “This line contains speculative changes” On commit: clear On violation: SM Invalid
Otherwise, just like a normal cache
SL
SM
Valid
LR
U
Tag
Data
81
escaping Speculation
Speculative epoch wants to make system visible change!
Ignore SM lines while escaped Stale bit
“This line may be outdated by speculative work.” On violation or commit: clear
SL
SM
Valid
LR
U
Tag
Data
Sta
le
82
L1 to L2 communication
L2 sees all stores (write through) L2 sees first load of an epoch
NotifySL message
L2 can track data dependences!
83
L1 Changes Summary
Add three bits to each line SL SM Stale
Modify tag match to recognize bits Add queue of NotifySL requests
SL
SM
Valid
LR
U
Tag
Data
Sta
le
84
L2 Cache Line
SL
SM
Fin
e G
rain
ed
SM
Valid
Dir
ty
Exclu
siv
e
LR
U
Tag
Data
SL
SM
CPU1 CPU2
Cache line can be: Modified by one CPU Loaded by multiple CPUs
85
Cache Line Conflicts
Three classes of conflict: Epoch 2 stores, epoch 1 loads
Need “old” version to load Epoch 1 stores, epoch 2 stores
Need to keep changes separate Epoch 1 loads, epoch 2 stores
Need to be able to discard line on violation
Need a way of storing multiple conflicting versions in the cache
86
Cache line replication
On conflict, replicate line Split line into two copies Divide SM and SL bits at split point Divide directory bits at split point
87
Replication Problems
Complicates line lookup Need to find all replicas and select “best” Best == most recent replica
Change management On write, update all later copies Also need to find all more speculative
replicas to check for violations On commit must get rid of stale lines
Invalidation Required Buffer (IRB)
88
Victim Cache
How do you deal with a full cache set? Use a victim cache
Holds evicted lines without losing SM & SL bits
Must be fast every cache lookup needs to know: Do I have the “best” replica of this line?
Critical path Do I cause a violation?
Not on critical path
89
Summary of Hardware Support
Sub-epochs Violations hurt less!
Shared cache TLS support Faster communication More room to store state
RAOs Don’t speculate on known operations Reduces amount of speculative state
90
Summary of Hardware Changes
Sub-epochs Checkpoint register state Needs replicas in cache
Shared cache TLS support Speculative L1 Replication in L1 and L2 Speculative victim cache Invalidation Required Buffer
RAOs Suspend/resume speculation Mutexes “Undo list”
91
TLS Execution
*p=
*q=
=*p
R2
Violation!
=*p
=*q
CPU
L1 $
L2 $
Rest of memory systemRest of memory system
CPU
L1 $
CPU
L1 $
CPU
L1 $L1 $ L1 $ L1 $ L1 $
*p
Invalidation
*p*p
*q
*p
*q*q
92
93
94
95
Problems with Old Cache Design
Database epochs are large L1 cache not large enough
Sub-epochs add more state L1 cache not associative enough
Database epochs communicate L1 cache only communicates committed
data
96
Intro Summary
TLS makes intra-transaction parallelism easy
Divide transaction into epochs Hardware support:
Detect violations Restart to recover
Sub-epochs mitigate penalty Buffer state
New process: Modify software
avoid violations improve performance
97
Money! pwhile()
{}
Money!pwhile()
{}
Money!pwhile()
{}
Money! pwhile()
{}
Money!pwhile()
{}
Money!pwhile()
{}
Money!pwhile()
{}
Money!pwhile()
{}
Money! pwhile()
{}
Money!pwhile()
{}
while(too_slow)
make_faster();
while(too_slow)
make_faster();
while(too_slow)
make_faster();
while(too_slow)
make_faster();
The Many Faces of Ogg
Must tune reorder buffer…
98
Must tune reorder buffer…
The Many Faces of Ogg
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Money!pwhile()
{}
while(too_slow)
make_faster();
while(too_slow)
make_faster();
while(too_slow)
make_faster();
while(too_slow)
make_faster();
99
Removing Bottlenecks
Three general techniques: Partition data structures
malloc Postpone operations until non-
speculative Latches and locks, log entries
Handle speculation manually Buffer pool
100
Bottlenecks Encoutered
Buffer pool Latches & Locks Malloc/free Cursor queues Error checks False sharing B-tree performance optimization Log entries
101
Must tune reorder buffer…
The Many Faces of Ogg
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Money! pwhile()
{}
while(too_slow)
make_faster();
while(too_slow)
make_faster();
while(too_slow)
make_faster();
while(too_slow)
make_faster();
102
Performance on 4 CPUs
Unmodified benchmark: Modified benchmark:
103
Incremental Parallelization
Tim
e4 CPUs
104
ScalingUnmodified benchmark: Modified benchmark:
105
Parallelization is Hard
Programmer Effort
Perf
orm
ance
Im
pro
vem
ent
Hand Parallelization
What we want
Parallelizing Compiler
Tuning
Tuning
TuningTuning
106
Case Study: New Order (TPC-C)
Begin transaction {
} End transaction
107
Case Study: New Order (TPC-C)
Begin transaction { Read customer info Read & increment order # Create new order
} End transaction
108
Case Study: New Order (TPC-C)
Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info }} End transaction
109
Case Study: New Order (TPC-C)
Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info }} End transaction
80% of transactionexecution time
110
Case Study: New Order (TPC-C)
Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info }} End transaction
80% of transactionexecution time
111
The Many Faces of Ogg
Duh…pwhile()
{}
while(too_slow)
make_faster();
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Duh…pwhile()
{}
Money!pwhile()
{}
while(too_slow)
make_faster();
while(too_slow)
make_faster();
while(too_slow)
make_faster();
Must tune reorder buffer…
Step 2: Changing the Software
113
No problem!
Loop is easy to parallelize using TLS! Not really Calls into DBMS invoke complex
operations Ogg needs to do some work
Many operations in DBMS are parallel Not written with TLS in mind!
114
Resource Management
Mutexes acquired and released
Locks locked and unlocked
Cursors pushed and popped from free stack
Memory allocated and freed
Buffer pool entries Acquired and released
115
Mutexes: Deadlock?
Problem: Re-ordered acquire/release operations!
Possibly introduced deadlock?
Solutions: Avoidance:
Static acquire order Recovery:
Detect deadlock and violate
116
Locks
Like mutexes, but: Allows multiple readers No memory overhead when not held Often held for much longer
Treat similarly to mutexes
117
Cursors
Used for traversing B-trees Pre-allocated, kept in pools
118
Maintaining Cursor Pool
Get
Use
Release
head
119
Maintaining Cursor Pool
Get
Use
Release
head
120
Maintaining Cursor Pool
Get
Use
Release
head
121
Maintaining Cursor Pool
Get
Use
Release
head
Get
Use
Release
Violation!
122
Parallelizing Cursor Pool
Use per-CPU pools: Modify code: each CPU gets its own
pool No sharing == no violations! Requires cpuid() instruction
Get
Use
Release
head Get
Use
Release
head
123
Memory Allocation
Problem: malloc() metadata causes
dependences
Solutions: Per-cpu memory pools Parallelized free list
124
The Log
Append records to global log Appending causes dependence Can’t parallelize:
Global log sequence number (LSN) Generate log records in buffers Assign LSNs when homefree
125
B-Trees
Leaf pages contain free space counts Inserts of random records – o.k. Inserting adjacent records
Dependence on decrementing count Page splits
Infrequent
126
Other Dependences
Statistics gathering Error checks False sharing
127
Related Work
Lots of work in TLS: Multiscalar (Wisconsin) Hydra (Stanford) IACOMA (Illinois) RAW (MIT)
Hand parallelizing using TLS: Manohar Prabhu and Kunle Olukotun
(PPoPP’03)
Any questions?
129
Why is this a problem?
B-tree insertion into ORDERLINE table Key is ol_n
DBMS does not know that keys will be sequential
Each insert usually updates the same btree page
for(ol_n=0; ol_n<15; ol_n++) { INSERT into ORDERLINE (ol_n, ol_item, ol_cnt) VALUES (:ol_n, :ol_item, :ol_cnt);}
130
Sequential Btree Inserts
4
free free
1
item free
free free
free free
4
item item
item item
free free
3
item item
item free
free free
2
item item
free free
free free
132
Outline
Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences
Aggressive update propagation (implicit forwarding)
Sub-epochs Results and analysis
133
Outline
Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences
Aggressive update propagation (implicit forwarding)
Sub-epochs Results and analysis
134
Tolerating dependences
Aggressive update propagation Get for free!
Sub-epochs Periodically checkpoint epochs Every N instructions?
Picking N may be interesting Perhaps checkpoints could be set before the
location of previous violations?
135
Outline
Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences
Aggressive update propagation (implicit forwarding)
Sub-epochs Results and analysis
136
Why not faster?
Possible reasons: Idle cpus RAO mutexes Violations Cache effects Data dependences
137
Why not faster?
Possible reasons: Idle cpus
9 epochs/region average Two bundles of four and one of one ¼ of cpu cycles wasted!
RAO mutexes Violations Cache effects Data dependences
138
Why not faster?
Possible reasons: Idle cpus RAO mutexes
Not implemented yet Ooops!
Violations Cache effects Data dependences
139
Why not faster?
Possible reasons: Idle cpus RAO mutexes Violations
21/969 epochs violated Distance 1 “magic synchronized” 2.2Mcycles (over 4 cpus)
About 1.5%
Cache effects Data dependences
140
Why not faster?
Possible reasons: Idle cpus RAO mutexes Violations Cache effects
Deserves its own slide. Data dependences
141
Cache effects of speculation
Only 20% of references are speculative! Speculative references have small impact
on non-speculative hit rate (<1%) Speculative refs miss a lot in L1
9-15% for reads, 2-6% for writes L2 saw HUGE increase in traffic
152k refs to 3474k refs Spec/non spec lines are thrashing from L1s
142
Why not faster?
Possible reasons: Idle cpus RAO mutexes Violations Cache effects Data dependences
Oh yeah! Btree item count
Split up btree insert? alloc and write Do alloc as RAO Needs more thought
143
L2 Cache Line
SL
SM
Fin
e G
rain
ed
SM
Valid
Dir
ty
Exclu
siv
e
LR
U
Tag
Data
SL
SM
CPU1 CPU2
Set
1S
et
2
144
Why are you here?
Want faster database systems Have funky new hardware –
Thread Level Speculation (TLS)
How can we apply TLS todatabase systems?
Side question: Is this a VLDB or an ASPLOS talk?
145
How?
Divide transaction into TLS-threads Run TLS-threads in parallel;
maintain sequential semantics Profit!
146
Why parallelize transactions?
Decrease transaction latency Increase concurrency while avoiding
concurrency control bottleneck A.k.a.: use more CPUs, same # of
xactions
The obvious: Database performance matters
147
Shopping List
What do we need? (research scope)
Cheap hardware Thread Level Speculation (TLS)
Minor changes allowed.
Important database application TPC-C
Almost no changes allowed!
Modular database system BerkeleyDB
Some changes allowed.
148
Outline
TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions
149
Outline
TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions
150
What’s new?
Database operations are: Large Complex
Large TLS-threads Lots of dependences Difficult to analyze
Want: Programmer optimization effort =
faster program
151
Hardware changes summary
Must tolerate dependences Prediction? Implicit forwarding?
May need larger caches May need larger associativity
152
Outline
TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions
153
Parallelization Strategy
1. Pick a benchmark2. Parallelize a loop3. Analyze dependences4. Optimize away dependences5. Evaluate performance6. If not satisfied, goto 3
154
Outline
TLS Hardware The Benchmark (TPC-C) Changing the database system
Resource management The log B-trees False sharing
Results Conclusions
155
Outline
TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions
156
Results
Viola simulator: Single CPI Perfect violation prediction No memory system 4 cpus Exhaustive dependence tracking
Currently working on an out-of-order superscalar simulation (cello)
10 transaction warm-up Measure 100 transactions
157
Outline
TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions
158
Conclusions
TLS can improve transaction latency Violation predictors important
Iff dependences must be tolerated TLS makes hand parallelizing easier
159
Improving Database Performance
How to improve performance: Parallelize transaction Increase number of concurrent
transactions
Both of these require independence of database operations!
160
Case Study: New Order (TPC-C)
Begin transaction {
} End transaction
161
Case Study: New Order (TPC-C)
Begin transaction { Read customer info (customer, warehouse) Read & increment order # (district) Create new order (orders, neworder)
} End transaction
162
Case Study: New Order (TPC-C)
Begin transaction { Read customer info (customer, warehouse) Read & increment order # (district) Create new order (orders, neworder) For each item in order { Get item info (item) Decrement count in stock (stock) Record order info (orderline) }} End transaction
163
Case Study: New Order (TPC-C)
Begin transaction { Read customer info (customer, warehouse) Read & increment order # (district) Create new order (orders, neworder) For each item in order { Get item info (item) Decrement count in stock (stock) Record order info (orderline) }} End transaction
Parallelizethis loop
164
Case Study: New Order (TPC-C)
Begin transaction { Read customer info (customer, warehouse) Read & increment order # (district) Create new order (orders, neworder) For each item in order { Get item info (item) Decrement count in stock (stock) Record order info (orderline) }} End transaction
Parallelizethis loop
165
Implementing on a Real DB
Using BerkeleyDB “Table” == “Database” Give database any arbitrary key will return arbitrary data (bytes) Use structs for keys and rows Database provides ACID through:
Transactions Locking (page level) Storage management
Provides indexing using b-trees
166
Parallelizing a Transaction
For each item in order { Get item info (item) Decrement count in stock (stock) Record order info (order line)}
167
Parallelizing a Transaction
For each item in order { Get item info (item) Decrement count in stock (stock) Record order info (order line)}
•Get cursor from pool•Use cursor to traverse b-tree•Find row, lock page for row•Release cursor to pool
•Get cursor from pool•Use cursor to traverse b-tree•Find row, lock page for row•Release cursor to pool
168
Maintaining Cursor Pool
Get
Use
Release
head
169
Maintaining Cursor Pool
Get
Use
Release
head
170
Maintaining Cursor Pool
Get
Use
Release
head
171
Maintaining Cursor Pool
Get
Use
Release
head
Get
Use
Release
Violation!
172
Parallelizing Cursor Pool 1
Use per-CPU pools: Modify code: each CPU gets its own
pool No sharing == no violations! Requires cpuid() instruction
Get
Use
Release
head Get
Use
Release
head
173
Parallelizing Cursor Pool 2
Dequeue and enqueue: atomic and unordered
Delay enqueue until end of thread Forces separate pools Avoids modification
of data structGet
Use
Release
head Get
Use
Release
174
Parallelizing Cursor Pool 3
Atomic unordered dequeue & enqueue
Cursor struct is “TLS unordered”
Struct defined as a byte range in memory
Get
Use
Release
headGet
Use
Release
Get
Use
Release
Get
Use
Release
175
Parallelizing Cursor Pool 4
Mutex protect dequeue & enqueue;declare pointer to cursor struct to be “TLS unordered”
Any access through pointer does not have TLS applied
Pointer is tainted, any copies of it keep this property
176
Problems with 3 & 4
What exactly is the boundary of a structure?
How do you express the concept of object in a loosely-typed language like C?
A byte range or a pointer is only an approximation.
Dynamically allocated sub-components?
177
Mutexes in a TLS world
Two types of threads: “real” threads TLS threads
Two types of mutexes: Inter-real-thread Inter-TLS-thread
178
Inter-real-thread Mutexes
Acquire == get mutex for all TLS threads
Release == release for current TLS thread May still be held by another TLS thread!
179
Inter-TLS-thread Mutexes
Should never interact between two real threads
Implies no TLS ordering between TLS threads while mutex is held
But what do to on a violation? Can’t just throw away changes to memory Must undo operations performed in
critical section
180
Parallelizing Databases using TLS
Split transactions into threads Threads created are large
60k+ instructions 16kB of speculative state More dependences between threads
How do we design a machine which can handle these large threads?
181
The “Old” Way
P
L1 $
L2 $
P
L1 $
P
L1 $
P
L1 $
Committed state
L3 $
P
L1 $
L2 $
P
L1 $
P
L1 $
P
L1 $
Memory System
…
Speculative state
182
The “Old” Way
Advantages Each epoch has its own L1 cache Epoch state does not intermix
Disadvantages L1 cache is too small! Full cache == dead meat No shared speculative memory
183
The “New” Way
L2 cache is huge! State of the art in caches, Power5:
1.92MB 10-way L2 32kB 4-way L1
Shared speculative memory “for free” Keeps TLS logic off of the critical path
184
TLS Shared L2 Design
L1 Write-through write-no-allocate [CullerSinghGupta99]
Easy to understand and reason about Writes visible to L2 – simplifies shared
speculative memory L2 cache: shared cache architecture with
replication Rest of memory: distributed TLS coherence
185
TLS Shared L2 Design
P
L1 $
L2 $
P
L1 $
P
L1 $
P
L1 $
“Real” speculative state
L3 $
P
L1 $
L2 $
P
L1 $
P
L1 $
P
L1 $
Memory System
…
Cached speculative state
Explain from the top down
186
TLS Shared L2 Design
P
L1 $
L2 $
P
L1 $
P
L1 $
P
L1 $
“Real” speculative state
L3 $
P
L1 $
L2 $
P
L1 $
P
L1 $
P
L1 $
Memory System
…
Cached speculative state
Part II: Dealing with dependences
188
Predictor Design
How do you design a predictor that: Identifies violating loads Identifies the last store that causes them Only triggers when they cause a problem Has very very high accuracy
???
189
Sub-epoch design
Like checkpointing Leave “holes” in epoch # space Every 5k instructions start a new
epoch Uses more cache to buffer changes
More strain on associativity/victim cache Uses more epoch contexts
190
Summary
Supporting large epochs needs: Buffer state in L2 instead of L1 Shared speculative memory Replication Victim cache Sub-epochs
Any questions?