29
Asymmetric Memory Fences: Optimizing Performance & Programmability Yuelu Duan, Nima Honarmand, Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu ASPLOS, March 2015

Asymmetric Memory Fences: Optimizing Performance

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Asymmetric Memory Fences: Optimizing Performance & Programmability

Yuelu Duan, Nima Honarmand, Josep Torrellas Department of Computer Science

University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

ASPLOS, March 2015

Duan, Honarmand,Torrellas Asymmetric Memory Fences

Fence: a (Slow) Primitive for Parallelism

•  Instruction inserted by programmers or compilers •  Prevents the compiler and HW from reordering memory accesses

2

wr y

fence

rd x

rd z

Until these are finished •  reads retired •  writes retired + drained from write buffer

Cannot be observed by another processor

Expensive: cost of a fence in Xeon-based desktop is 20—200 cycles

There are HW proposals to eliminate fence stall: WeeFence [ISCA13]… … but they need complex HW

Duan, Honarmand,Torrellas Asymmetric Memory Fences

Contributions

•  Asymmetric Fence Groups: eliminate fence stall with simple HW –  Weak Fence (wF): aggressive access reordering –  Strong Fence (sF): conventional

•  Best fence performance-cost tradeoff yet •  Taxonomy of Asymmetric Fence Groups under TSO

3

Duan, Honarmand,Torrellas Asymmetric Memory Fences 4

Past Work: WeeFence

•  Aggressive access reorder to eliminate pipeline stall •  Post-fence reads retire before the pre-fence writes have drained

–  “Skip” the fence

Substantial gains when write misses pile-up before the fence

w2 f r

w1

w1

w2 f r ROB

WB

Spec execution

[ISCA 2013]

Write

Fence

Read time

Duan, Honarmand,Torrellas Asymmetric Memory Fences 5

But… Reordering Can Cause Incorrect Execution

With fences: t0=1 or t1=1 or both=1

A0: x =1

A1: t0 = yB0: y = 1

B1: t1 = x

x = y = 0PA PB

fence

fence Sequential Consistency(SC) Violation

wr x

rd y

PA PB

fencewr yfencerd x

Wee

Wee

t0 = t1 = 0

A1B0B1A0

With WeeFences:

Solution: WeeFence stall rds/wrs if reordering may cause a cycle

Duan, Honarmand,Torrellas Asymmetric Memory Fences 6

How WeeFence Works

PS: Pending Set

BS: Bypass Set rd y

PA PB

Weefence1 wr y Weefence2 rd x

wr x

Fence group

Duan, Honarmand,Torrellas Asymmetric Memory Fences 7

wr x

rd y

PA

Weefence

wr x

rd y

PA

Weefence

How WeeFence Works

(1) PS

x

(4)y BS

z

(2)check

(3)

PS: Pending Set BS: Bypass Set

Table

Global Reorder Table (GRT) in shared memory

z

(5)execute (6)

bounces

wr

Duan, Honarmand,Torrellas Asymmetric Memory Fences

WeeFence Hardware is Suboptimal

•  Requires GRT global hardware and additional messages

8

Our goal: High performance and simpler hardware

•  GRT is hard to distribute like the directory

–  Creates coherence protocol races –  If PS maps to multiple directory modules àturns into

conventional fence

Duan, Honarmand,Torrellas Asymmetric Memory Fences

Weak Fence (wF)

•  Reordering capabilities of WeeFence + no global state

9

(1)

y BS z x

(1)(1)

BS

(2)(2)

(2)

If all the fences in a Fence Group are wF à Deadlock

wr x

rd y

wF

wr y

rd z

wF

wr z

rd x

wF

PA PB PC

PS wF

wr y

rd zrd y

wr z

rd xwFwF

wr x

Duan, Honarmand,Torrellas Asymmetric Memory Fences

Insight: No Deadlock If At Least One Conventional Fence

•  Strong Fence (sF): Conventional fence •  If the group has at least one sF è no deadlock or SC violation

10

(1) (1)

y BS z

(2)(2)

(2)

OKAY

wr x

rd y

wF

wr y

rd z

wF

wr z

rd x

sF

PA PB PC

Duan, Honarmand,Torrellas Asymmetric Memory Fences

Asymmetric Fence Group

•  Contains one or more wF and one or more sF •  Much simpler HW than WeeFence: no global state •  Similar performance (or higher) than WeeFence:

–  Insight: In a fence group, some threads more critical than others –  Use wF for the critical threads and sF for others

11

Best fence performance-cost trade-off yet

Duan, Honarmand,Torrellas Asymmetric Memory Fences

Use of Asymmetric Fences: Work Stealing

•  Cilk, TBB and other runtime schedulers •  steal() < 5%

12

take() steal()

Tail =

= Head

Head =

fence

= Tail

fence wF sF

Duan, Honarmand,Torrellas Asymmetric Memory Fences

Use of Asymmetric Fences: Software Transactional Memory (STM)

•  Read and Write Barriers in STM: –  Perform the requested rd/wr –  Update STM metadata to ensure transaction serialization

•  Reads are 3.5x more frequent than writes

13

read(M) write(M)

Lock(M).readers =

= Lock(M).writer

Lock(M).writer =

fence

= Lock(M).readers

fence wF sF

Duan, Honarmand,Torrellas Asymmetric Memory Fences

Taxonomy of Asymmetric Fence Groups under TSO

14

WS+ Asymmetric group with at most one wF

wF sF sF

SW+ Any Asymmetric group

sF wF wF

W+ Group with only wFs wF wF wF

They trade off hardware cost for performance

sF sF sF S+ Fence group only contains sFs

Duan, Honarmand,Torrellas Asymmetric Memory Fences

BS

WS+: Asymmetric Groups with at Most one wF

•  Pre-wF accesses never bounce-off from another processor’s BS •  Addresses and checks always at line level (as conventional protocol) •  False sharing may create bouncing

15

wFwr y

rd zrd y

wr z

rd xsFsF

wr x

wFwr y

rd x’rd y’wF

wr x

Write proceeds (no SC violation possible) Other processor kept as sharer so that it sees future coherence activity Supported with the Order bit (see paper)

Duan, Honarmand,Torrellas Asymmetric Memory Fences

BS

W+: Groups with only wFs

16

•  Can deadlock under unfavorable timing –  True cycle or due to false sharing

•  Insight: Under TSO only, recovery is not too costly –  Completed accesses are only reads –  No speculative writes

•  When wF reaches ROB head à checkpoint and proceed •  If HW detects bouncing and being bounced à timeout

–  Rollback –  Wait for write buffer to drain (no deadlock again)

wFwr y

rd zrd y

wr z

rd xwFwF

wr x

Duan, Honarmand,Torrellas Asymmetric Memory Fences 17

Evaluation

•  Simulations of 8-core multicore •  Workloads:

–  10 Cilk apps using the THE work-stealing algorithm –  10 Software Transactional Memory (STM) kernels –  6 STAMP apps that use STM

•  Goal: Speed-up execution •  Compare execution time of WS+ and W+ to:

–  S+: conventional fences –  WeeFence

Duan, Honarmand,Torrellas Asymmetric Memory Fences 18

Per-Transaction Execution Time of STM Kernels

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

CounterB

ench.S+

CounterB

ench.WS

+ C

ounterBench.W

+ C

ounterBench.W

ee

DListB

ench.S+

DListB

ench.WS

+ D

ListBench.W

+ D

ListBench.W

ee

ForestBench.S

+ ForestB

ench.WS

+ ForestB

ench.W+

ForestBench.W

ee

HashB

ench.S+

HashB

ench.WS

+ H

ashBench.W

+ H

ashBench.W

ee

ListBench.S

+ ListB

ench.WS

+ ListB

ench.W+

ListBench.W

ee

MC

AS

Bench.S

+ M

CA

SB

ench.WS

+ M

CA

SB

ench.W+

MC

AS

Bench.W

ee

ReadN

Write1B

ench.S+

ReadN

Write1B

ench.WS

+ R

eadNW

rite1Bench.W

+ R

eadNW

rite1Bench.W

ee

ReadW

riteNB

ench.S+

ReadW

riteNB

ench.WS

+ R

eadWriteN

Bench.W

+ R

eadWriteN

Bench.W

ee

TreeBench.S

+ TreeB

ench.WS

+ TreeB

ench.W+

TreeBench.W

ee

TreeOverw

riteBench.S

+ TreeO

verwriteB

ench.WS

+ TreeO

verwriteB

ench.W+

TreeOverw

riteBench.W

ee

[uSTM

-AVG

].S+

[uSTM

-AVG

].WS

+ [uS

TM-AV

G].W

+ [uS

TM-AV

G].W

ee

Per-transaction breakdown of processor cycles

Busy Time Other Stall Time Fence Stall Time

•  Choice WS+ vs W+: Order bit for false sharing vs hardware timeouts

•  WS+ and W+: •  Eliminate avg. 1/2 and 2/3 of the fence stall time •  Avg. transaction takes 24% and 35% fewer cycles

•  Wee has only small fence time reductions à turns many fences into sFs

Duan, Honarmand,Torrellas Asymmetric Memory Fences 19

Conclusions

•  Asymmetric Fences for performance and simple HW: –  Weak Fence (wF) –  Strong Fence (sF)

•  Best fence performance-cost trade-off yet –  Simpler HW than WeeFence: no global hardware –  Higher perf. than WeeFence: 13% (WS+) and 21% (W+) avg.

•  Taxonomy of Asymmetric fence groups under TSO •  Outlined uses •  Future Work: programmability issues

Asymmetric Memory Fences: Optimizing Performance & Programmability

Yuelu Duan, Nima Honarmand, Josep Torrellas Department of Computer Science

University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

ASPLOS, March 2014

Duan, Honarmand,Torrellas Asymmetric Memory Fences

Use of Asymmetric Fences: Bakery Algorithm

21

E[ownpid] = … fence for all pid … = E[pid]

E[pid4] = … fence … = E[pid1]

E[pid1] = … fence … = E[pid3]

E[pid3] = … fence … = E[pid4]

PA PB PC

Fence groups of many different sizes

sF sF

Give priority to PA

wF

Duan, Honarmand,Torrellas Asymmetric Memory Fences 22

Execution Time of Cilk Apps

0

0.2

0.4

0.6

0.8

1

1.2

bucket.S+

bucket.WS

+ bucket.W

+ bucket.W

ee

cholesky.S+

cholesky.WS

+ cholesky.W

+ cholesky.W

ee

cilksort.S+

cilksort.WS

+ cilksort.W

+ cilksort.W

ee

fft.S+

fft.WS

+ fft.W

+ fft.W

ee

fib.S+

fib.WS

+ fib.W

+ fib.W

ee

heat.S+

heat.WS

+ heat.W

+ heat.W

ee

knapsack.S+

knapsack.WS

+ knapsack.W

+ knapsack.W

ee

lu.S+

lu.WS

+ lu.W

+ lu.W

ee

matm

ul.S+

matm

ul.WS

+ m

atmul.W

+ m

atmul.W

ee

plu.S+

plu.WS

+ plu.W

+ plu.W

ee

[CILK

-AVG

].S+

[CILK

-AVG

].WS

+ [C

ILK-AV

G].W

+ [C

ILK-AV

G].W

ee

Execution time of CilkApps

Busy Time Other Stall Time Fence Stall Time •  WS+ and W+: •  Eliminate most of the fence stall time •  Similar impact as Wee, which is more expensive

•  Overall execution time reduced by avg. 9%

Duan, Honarmand,Torrellas Asymmetric Memory Fences 23

Also in the Paper

•  Description of taxonomy in detail •  Hardware implementation issues

–  Line displacements, RC… •  Outline interesting programming issues •  Evaluation of STAMP benchmarks •  Scalability to 32 threads

Duan, Honarmand,Torrellas Asymmetric Memory Fences 24

(2)

execute

(1)PS

x

wr x

rd y

PA PB

Wfence1

wr y

rd x

Wfence2

How WeeFence Works

PS: Pending Set

BS: Bypass Set rd y

PA PB

Weefence1wr yWeefence2rd x

wr x

Duan, Honarmand,Torrellas Asymmetric Memory Fences 25

wr x

rd y

PA PB

Wfence1

(1)(2) PS

execute

wr y

x

(4)

local check stall

(5)

How WFence Works

PS

y

(3)

Table

Wfence2

rd x

PS: Pending Set

BS: Bypass Set

wr x

rd y

PA PB

Wfence1wr y

Wfence2rd x

x

Duan, Honarmand,Torrellas Asymmetric Memory Fences 26

(2)

execute

wr x

rd y

PA PB

Wfence1

wr y

wr x

y BS

(3)

How WFence Works (II)

(1)PS

x

Table

PS: Pending Set

BS: Bypass Set

wr x

rd y

PA PB

Wfence1wr y

wr xNo fence present in TSO

Duan, Honarmand,Torrellas Asymmetric Memory Fences 27

wr x

rd y

PA PB

Wfence1

(1)(2) PS execute

x

wr y

wr x

y BS

(3)

(4) coherence

squash or bounce

How WFence Works (II)

Table

PS: Pending Set

BS: Bypass Set

wr x

rd y

PA PB

Wfence1wr y

wr xNo fence present in TSO

Duan, Honarmand,Torrellas Asymmetric Memory Fences 28

wr x

rd y

PA

Wfence1

wr x

rd y

PA

Wfence1

(1)PS

x

Summary: How WFence Works

z

Table

(6)squash or bounce

(4)y BS

(5)execute

z

(2)check

(3)

PS: Pending Set BS: Bypass Set

Duan, Honarmand,Torrellas Asymmetric Memory Fences 29

wr x

rd y

PA

Wfence1

wr x

rd y

PA

Wfence1

(1)PS

x

Summary: How WFence Works

z

Table

(6)squash or bounce

(4)y BS

(5)execute

z

(2)check

(3)

PS: Pending Set BS: Bypass Set

Global Reorder Table (GRT) in shared memory (signatures)

Register in the processor (signature)

List of addresses in the cache