Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Asymmetric Memory Fences: Optimizing Performance & Programmability
Yuelu Duan, Nima Honarmand, Josep Torrellas Department of Computer Science
University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu
ASPLOS, March 2015
Duan, Honarmand,Torrellas Asymmetric Memory Fences
Fence: a (Slow) Primitive for Parallelism
• Instruction inserted by programmers or compilers • Prevents the compiler and HW from reordering memory accesses
2
wr y
fence
rd x
rd z
Until these are finished • reads retired • writes retired + drained from write buffer
Cannot be observed by another processor
Expensive: cost of a fence in Xeon-based desktop is 20—200 cycles
There are HW proposals to eliminate fence stall: WeeFence [ISCA13]… … but they need complex HW
Duan, Honarmand,Torrellas Asymmetric Memory Fences
Contributions
• Asymmetric Fence Groups: eliminate fence stall with simple HW – Weak Fence (wF): aggressive access reordering – Strong Fence (sF): conventional
• Best fence performance-cost tradeoff yet • Taxonomy of Asymmetric Fence Groups under TSO
3
Duan, Honarmand,Torrellas Asymmetric Memory Fences 4
Past Work: WeeFence
• Aggressive access reorder to eliminate pipeline stall • Post-fence reads retire before the pre-fence writes have drained
– “Skip” the fence
Substantial gains when write misses pile-up before the fence
w2 f r
w1
w1
w2 f r ROB
WB
Spec execution
[ISCA 2013]
Write
Fence
Read time
Duan, Honarmand,Torrellas Asymmetric Memory Fences 5
But… Reordering Can Cause Incorrect Execution
With fences: t0=1 or t1=1 or both=1
A0: x =1
A1: t0 = yB0: y = 1
B1: t1 = x
x = y = 0PA PB
fence
fence Sequential Consistency(SC) Violation
wr x
rd y
PA PB
fencewr yfencerd x
Wee
Wee
t0 = t1 = 0
A1B0B1A0
With WeeFences:
Solution: WeeFence stall rds/wrs if reordering may cause a cycle
Duan, Honarmand,Torrellas Asymmetric Memory Fences 6
How WeeFence Works
PS: Pending Set
BS: Bypass Set rd y
PA PB
Weefence1 wr y Weefence2 rd x
wr x
Fence group
Duan, Honarmand,Torrellas Asymmetric Memory Fences 7
wr x
rd y
PA
Weefence
wr x
rd y
PA
Weefence
How WeeFence Works
(1) PS
x
(4)y BS
z
(2)check
(3)
PS: Pending Set BS: Bypass Set
Table
Global Reorder Table (GRT) in shared memory
z
(5)execute (6)
bounces
wr
Duan, Honarmand,Torrellas Asymmetric Memory Fences
WeeFence Hardware is Suboptimal
• Requires GRT global hardware and additional messages
8
Our goal: High performance and simpler hardware
• GRT is hard to distribute like the directory
– Creates coherence protocol races – If PS maps to multiple directory modules àturns into
conventional fence
Duan, Honarmand,Torrellas Asymmetric Memory Fences
Weak Fence (wF)
• Reordering capabilities of WeeFence + no global state
9
(1)
y BS z x
(1)(1)
BS
(2)(2)
(2)
If all the fences in a Fence Group are wF à Deadlock
wr x
rd y
wF
wr y
rd z
wF
wr z
rd x
wF
PA PB PC
PS wF
wr y
rd zrd y
wr z
rd xwFwF
wr x
Duan, Honarmand,Torrellas Asymmetric Memory Fences
Insight: No Deadlock If At Least One Conventional Fence
• Strong Fence (sF): Conventional fence • If the group has at least one sF è no deadlock or SC violation
10
(1) (1)
y BS z
(2)(2)
(2)
OKAY
wr x
rd y
wF
wr y
rd z
wF
wr z
rd x
sF
PA PB PC
Duan, Honarmand,Torrellas Asymmetric Memory Fences
Asymmetric Fence Group
• Contains one or more wF and one or more sF • Much simpler HW than WeeFence: no global state • Similar performance (or higher) than WeeFence:
– Insight: In a fence group, some threads more critical than others – Use wF for the critical threads and sF for others
11
Best fence performance-cost trade-off yet
Duan, Honarmand,Torrellas Asymmetric Memory Fences
Use of Asymmetric Fences: Work Stealing
• Cilk, TBB and other runtime schedulers • steal() < 5%
12
take() steal()
Tail =
= Head
Head =
fence
= Tail
fence wF sF
Duan, Honarmand,Torrellas Asymmetric Memory Fences
Use of Asymmetric Fences: Software Transactional Memory (STM)
• Read and Write Barriers in STM: – Perform the requested rd/wr – Update STM metadata to ensure transaction serialization
• Reads are 3.5x more frequent than writes
13
read(M) write(M)
Lock(M).readers =
= Lock(M).writer
Lock(M).writer =
fence
= Lock(M).readers
fence wF sF
Duan, Honarmand,Torrellas Asymmetric Memory Fences
Taxonomy of Asymmetric Fence Groups under TSO
14
WS+ Asymmetric group with at most one wF
wF sF sF
SW+ Any Asymmetric group
sF wF wF
W+ Group with only wFs wF wF wF
They trade off hardware cost for performance
sF sF sF S+ Fence group only contains sFs
Duan, Honarmand,Torrellas Asymmetric Memory Fences
BS
WS+: Asymmetric Groups with at Most one wF
• Pre-wF accesses never bounce-off from another processor’s BS • Addresses and checks always at line level (as conventional protocol) • False sharing may create bouncing
15
wFwr y
rd zrd y
wr z
rd xsFsF
wr x
wFwr y
rd x’rd y’wF
wr x
Write proceeds (no SC violation possible) Other processor kept as sharer so that it sees future coherence activity Supported with the Order bit (see paper)
Duan, Honarmand,Torrellas Asymmetric Memory Fences
BS
W+: Groups with only wFs
16
• Can deadlock under unfavorable timing – True cycle or due to false sharing
• Insight: Under TSO only, recovery is not too costly – Completed accesses are only reads – No speculative writes
• When wF reaches ROB head à checkpoint and proceed • If HW detects bouncing and being bounced à timeout
– Rollback – Wait for write buffer to drain (no deadlock again)
wFwr y
rd zrd y
wr z
rd xwFwF
wr x
Duan, Honarmand,Torrellas Asymmetric Memory Fences 17
Evaluation
• Simulations of 8-core multicore • Workloads:
– 10 Cilk apps using the THE work-stealing algorithm – 10 Software Transactional Memory (STM) kernels – 6 STAMP apps that use STM
• Goal: Speed-up execution • Compare execution time of WS+ and W+ to:
– S+: conventional fences – WeeFence
Duan, Honarmand,Torrellas Asymmetric Memory Fences 18
Per-Transaction Execution Time of STM Kernels
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
CounterB
ench.S+
CounterB
ench.WS
+ C
ounterBench.W
+ C
ounterBench.W
ee
DListB
ench.S+
DListB
ench.WS
+ D
ListBench.W
+ D
ListBench.W
ee
ForestBench.S
+ ForestB
ench.WS
+ ForestB
ench.W+
ForestBench.W
ee
HashB
ench.S+
HashB
ench.WS
+ H
ashBench.W
+ H
ashBench.W
ee
ListBench.S
+ ListB
ench.WS
+ ListB
ench.W+
ListBench.W
ee
MC
AS
Bench.S
+ M
CA
SB
ench.WS
+ M
CA
SB
ench.W+
MC
AS
Bench.W
ee
ReadN
Write1B
ench.S+
ReadN
Write1B
ench.WS
+ R
eadNW
rite1Bench.W
+ R
eadNW
rite1Bench.W
ee
ReadW
riteNB
ench.S+
ReadW
riteNB
ench.WS
+ R
eadWriteN
Bench.W
+ R
eadWriteN
Bench.W
ee
TreeBench.S
+ TreeB
ench.WS
+ TreeB
ench.W+
TreeBench.W
ee
TreeOverw
riteBench.S
+ TreeO
verwriteB
ench.WS
+ TreeO
verwriteB
ench.W+
TreeOverw
riteBench.W
ee
[uSTM
-AVG
].S+
[uSTM
-AVG
].WS
+ [uS
TM-AV
G].W
+ [uS
TM-AV
G].W
ee
Per-transaction breakdown of processor cycles
Busy Time Other Stall Time Fence Stall Time
• Choice WS+ vs W+: Order bit for false sharing vs hardware timeouts
• WS+ and W+: • Eliminate avg. 1/2 and 2/3 of the fence stall time • Avg. transaction takes 24% and 35% fewer cycles
• Wee has only small fence time reductions à turns many fences into sFs
Duan, Honarmand,Torrellas Asymmetric Memory Fences 19
Conclusions
• Asymmetric Fences for performance and simple HW: – Weak Fence (wF) – Strong Fence (sF)
• Best fence performance-cost trade-off yet – Simpler HW than WeeFence: no global hardware – Higher perf. than WeeFence: 13% (WS+) and 21% (W+) avg.
• Taxonomy of Asymmetric fence groups under TSO • Outlined uses • Future Work: programmability issues
Asymmetric Memory Fences: Optimizing Performance & Programmability
Yuelu Duan, Nima Honarmand, Josep Torrellas Department of Computer Science
University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu
ASPLOS, March 2014
Duan, Honarmand,Torrellas Asymmetric Memory Fences
Use of Asymmetric Fences: Bakery Algorithm
21
E[ownpid] = … fence for all pid … = E[pid]
E[pid4] = … fence … = E[pid1]
E[pid1] = … fence … = E[pid3]
E[pid3] = … fence … = E[pid4]
PA PB PC
Fence groups of many different sizes
sF sF
Give priority to PA
wF
Duan, Honarmand,Torrellas Asymmetric Memory Fences 22
Execution Time of Cilk Apps
0
0.2
0.4
0.6
0.8
1
1.2
bucket.S+
bucket.WS
+ bucket.W
+ bucket.W
ee
cholesky.S+
cholesky.WS
+ cholesky.W
+ cholesky.W
ee
cilksort.S+
cilksort.WS
+ cilksort.W
+ cilksort.W
ee
fft.S+
fft.WS
+ fft.W
+ fft.W
ee
fib.S+
fib.WS
+ fib.W
+ fib.W
ee
heat.S+
heat.WS
+ heat.W
+ heat.W
ee
knapsack.S+
knapsack.WS
+ knapsack.W
+ knapsack.W
ee
lu.S+
lu.WS
+ lu.W
+ lu.W
ee
matm
ul.S+
matm
ul.WS
+ m
atmul.W
+ m
atmul.W
ee
plu.S+
plu.WS
+ plu.W
+ plu.W
ee
[CILK
-AVG
].S+
[CILK
-AVG
].WS
+ [C
ILK-AV
G].W
+ [C
ILK-AV
G].W
ee
Execution time of CilkApps
Busy Time Other Stall Time Fence Stall Time • WS+ and W+: • Eliminate most of the fence stall time • Similar impact as Wee, which is more expensive
• Overall execution time reduced by avg. 9%
Duan, Honarmand,Torrellas Asymmetric Memory Fences 23
Also in the Paper
• Description of taxonomy in detail • Hardware implementation issues
– Line displacements, RC… • Outline interesting programming issues • Evaluation of STAMP benchmarks • Scalability to 32 threads
Duan, Honarmand,Torrellas Asymmetric Memory Fences 24
(2)
execute
(1)PS
x
wr x
rd y
PA PB
Wfence1
wr y
rd x
Wfence2
How WeeFence Works
PS: Pending Set
BS: Bypass Set rd y
PA PB
Weefence1wr yWeefence2rd x
wr x
Duan, Honarmand,Torrellas Asymmetric Memory Fences 25
wr x
rd y
PA PB
Wfence1
(1)(2) PS
execute
wr y
x
(4)
local check stall
(5)
How WFence Works
PS
y
(3)
Table
Wfence2
rd x
PS: Pending Set
BS: Bypass Set
wr x
rd y
PA PB
Wfence1wr y
Wfence2rd x
x
Duan, Honarmand,Torrellas Asymmetric Memory Fences 26
(2)
execute
wr x
rd y
PA PB
Wfence1
wr y
wr x
y BS
(3)
How WFence Works (II)
(1)PS
x
Table
PS: Pending Set
BS: Bypass Set
wr x
rd y
PA PB
Wfence1wr y
wr xNo fence present in TSO
Duan, Honarmand,Torrellas Asymmetric Memory Fences 27
wr x
rd y
PA PB
Wfence1
(1)(2) PS execute
x
wr y
wr x
y BS
(3)
(4) coherence
squash or bounce
How WFence Works (II)
Table
PS: Pending Set
BS: Bypass Set
wr x
rd y
PA PB
Wfence1wr y
wr xNo fence present in TSO
Duan, Honarmand,Torrellas Asymmetric Memory Fences 28
wr x
rd y
PA
Wfence1
wr x
rd y
PA
Wfence1
(1)PS
x
Summary: How WFence Works
z
Table
(6)squash or bounce
(4)y BS
(5)execute
z
(2)check
(3)
PS: Pending Set BS: Bypass Set
Duan, Honarmand,Torrellas Asymmetric Memory Fences 29
wr x
rd y
PA
Wfence1
wr x
rd y
PA
Wfence1
(1)PS
x
Summary: How WFence Works
z
Table
(6)squash or bounce
(4)y BS
(5)execute
z
(2)check
(3)
PS: Pending Set BS: Bypass Set
Global Reorder Table (GRT) in shared memory (signatures)
Register in the processor (signature)
List of addresses in the cache