36
1 UNIVERSITY OF MASSACHUSETTS, AMHERST School of Computer Science PREDATOR: Predictive False Sharing Detection Tongping Liu*, Chen Tian, Ziang Hu, Emery Berger* *University of Massachusetts Amherst Huawei US Research Center

Predator : Predictive False Sharing Detection

  • Upload
    manton

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Predator : Predictive False Sharing Detection. Tongping Liu* , Chen Tian , Ziang Hu , Emery Berger*. *University of Massachusetts Amherst Huawei US Research Center. Parallelism: Expectation is Awesome. Parallel Program.  Expectation. int count[8]; i nt W; void increment( int S) - PowerPoint PPT Presentation

Citation preview

Page 1: Predator : Predictive False Sharing Detection

1 UNIVERSITY OF MASSACHUSETTS, AMHERST • School of Computer Science

PREDATOR: Predictive False Sharing Detection

Tongping Liu*, Chen Tian, Ziang Hu, Emery Berger*

*University of Massachusetts AmherstHuawei US Research Center

Page 2: Predator : Predictive False Sharing Detection

2

Parallelism: Expectation is Awesome

Ru

nti

me (

s)

1 2 4 80

10

20

30

40

50

60

70

80

90

Number of threads

Expectation

Parallel Program

int count[8];int W; void increment(int S) { for(in=S; in<S+W; in++) for(j=0; j<1M; j++) count[in]++;}

int main(int THREADS) { W=8/THREADS; for(i=0; i<8; i+=W) spawn(increment,i);}

Page 3: Predator : Predictive False Sharing Detection

3

1 2 4 80

20

40

60

80

100

120

140

Number of threads

False sharing slows the program by 13X

Ru

nti

me (

s)

Parallel Program Expectation

Reality

Parallelism: Reality is Awful

int count[8];int W; void increment(int S) { for(in=S; in<S+W; in++) for(j=0; j<1M; j++) count[in]++;}

int main(int THREADS) { W=8/THREADS; for(i=0; i<8; i+=W) spawn(increment,i);}

False sharing

Page 4: Predator : Predictive False Sharing Detection

4

False Sharing in Real Applications

False sharing slows MySQL by 50%

Page 5: Predator : Predictive False Sharing Detection

5

Cache Line

False Sharing vs. True Sharing

Page 6: Predator : Predictive False Sharing Detection

6

Task 3Task 1

Task 2 Task 4

False Sharing

Task 1

TrueSharing

Task 2

False Sharing vs. True Sharing

Page 7: Predator : Predictive False Sharing Detection

7

Resource Contention at Cache Line Level

Page 8: Predator : Predictive False Sharing Detection

8

Thread 1

Main Memory

Core 1

Thread 2

Core 2

Cache

Cache

Invalidate

Cache line: basic unit of data transfer

False Sharing Causes Performance Problems

Page 9: Predator : Predictive False Sharing Detection

9

Thread 1 Thread 2

Cache

Cache

Invalidate

Interleaved accesses cause cache invalidations

Main Memory

Core 1 Core 2

False Sharing Causes Performance Problems

Page 10: Predator : Predictive False Sharing Detection

10

me = 1;you = 1; // globals

me = new Foo;you = new Bar; // heap

class X { int me; int you;}; // fields

array[me] = 12;array[you] = 13; // array indices

False Sharing is Everywhere

Page 11: Predator : Predictive False Sharing Detection

11

False Sharing is Hard to Diagnose

Multiple experts worked together to diagnose MySQL scalability issue (1.5M LOC)

Page 12: Predator : Predictive False Sharing Detection

12

Problems of Existing Tools

• No precise information/false positives–WIBA’09, VEE’11, EuroSys’13, SC’13

• Accurate & Precise– OOPSLA’11 ( Cannot detect read-write

FS)Shared problem: only detect observed false sharing

Page 13: Predator : Predictive False Sharing Detection

13

Task 1

Task 2

Cache

Cache

Invalidate

Main Memory

Core 1 Core 2

False Sharing Causes Performance Problems

Find cache lines with many cache invalidations

Interleaved accesses

Cache invalidations

Performance problems

Detect false sharing causing performance problems

Page 14: Predator : Predictive False Sharing Detection

14

Find Lines with Many Invalidations

. . . . . . .

……

Track cache invalidations on each cache line

Memory: Global, Heap

Page 15: Predator : Predictive False Sharing Detection

15

Track Cache Invalidations

• Hardware-based approach– Needs hardware

support– No portability

• Simulation-based approach– Needs hardware info

such as cache hierarchy, cache capacity

– Very slow

• Conservative Assumptions– Each thread runs on a

different core with its private cache.

– Infinite cache capacity.

PREDATOR: based on memory access history of each cache line

Page 16: Predator : Predictive False Sharing Detection

16

Track Cache Invalidations

r w r ww r w r

T1 T2

0

# of invalidations

12

Time

30 0 0 0T2 r T1 rT2 w

Each Entry: { Thread ID, Access Type}

T2 w 0 0T1 wT2 w 0 0T1 r

Page 17: Predator : Predictive False Sharing Detection

17

PREDATOR Components

Compiler Instrumenta

tion

Runtime System

Instruments every memory read/write

access

Collects memory accesses and reports

false sharing

Page 18: Predator : Predictive False Sharing Detection

18

Detect Problems Correctly & Precisely

• Correctly: –No false

alarms

Task 3Task 1

Task 2 Task 4

False Sharing

Task 1

TrueSharing

Task 2

Track memory accesses on each word

• Precisely– Global variables–Heap objects: pinpoint the line of memory

allocation

Page 19: Predator : Predictive False Sharing Detection

19

PREDATOR’s Report

Page 20: Predator : Predictive False Sharing Detection

20

Why do we need prediction?

Page 21: Predator : Predictive False Sharing Detection

21

Necessity of False Sharing Prediction

Thread 1 Thread 2

Cache line 1 Cache line 2

Cache line 1 Cache line 2

False Sharin

g

Cache line 1

False Sharin

g

Page 22: Predator : Predictive False Sharing Detection

22

Properties Affecting False Sharing Occurrence

32-bit platform 64-bit platformDifferent memory allocatorDifferent compiler or optimizationDifferent allocation order by changing the

code, e.g., printf

• Change of memory layout

• Run on hardware with different cache line size

Page 23: Predator : Predictive False Sharing Detection

23

Example of False Sharing Sensitivity

Offset = 0

Offset = 8

Offset = 56

……

Memory

Colors represent threads

Cache line size = 64 bytes

Page 24: Predator : Predictive False Sharing Detection

24

Offse

t=0

Offse

t=8

Offse

t=16

Offse

t=24

Offse

t=32

Offse

t=40

Offse

t=48

Offse

t=56

0

1

2

3

4

5

6

Ru

nti

me (

Secon

ds)

PREDATOR predicts false sharing

problems without occurrence

Example of False Sharing Sensitivity

Page 25: Predator : Predictive False Sharing Detection

25

Prediction Based on Virtual Cache Lines

Thread 1 Thread 2

Cache line 1 Cache line 2

Virtual cache line 1 Virtual cache line 2

False Sharin

g

Virtual cache line 1

False Sharin

g

Real case

Prediction 1

Prediction 2

Page 26: Predator : Predictive False Sharing Detection

26

d YX

(sz-d)/2 (sz-d)/2

Tracked virtual line

Non-tracked virtual lines

Track Invalidations on Virtual Cache Lines

d < the cache line size - sz(X, Y) from different threads && one of

them is write

Page 27: Predator : Predictive False Sharing Detection

27

Benchmark Results

Benchmarks Unknown Problem

Without Prediction

With Prediction Improvement

Histogram ✔ ✔ ✔ 46%

Linear_regression ✔ 1207%

Reverse_index ✔ ✔ 0.09%

Word_count ✔ ✔ 0.14%

Streamcluster-1 ✔ ✔ ✔ 4.77%

Streamcluster-2 ✔ ✔ 7.52%

Page 28: Predator : Predictive False Sharing Detection

28

Real Applications Results

• MySQL– Problem: False sharing occurs when

different threads update the shared bitmap simultaneously.

– Performance improves 180% after fixes.

• Boost library:– Problem: “there will be 16 spinlocks per

cache line”– Performance improves about 100%.

Page 29: Predator : Predictive False Sharing Detection

29

Performance Overhead of PREDATOR

Phoenix

histogra

m

kmea

ns

linear_

regressi

on

matrix_

multiply pca

revers

e_index

string_m

atch

word_c

ount

PARSEC

blacks

choles

bodytrac

k

dedup

ferret

fluidanimate

strea

mcluste

r

swap

tions x2

64

RealApplica

tionsag

etBoost

Mem

cach

ed

MyS

QL

pbzip2

pfscan

AVERAGE

0

3

6

9

12

15

Execution Time Overhead

Original

PREDATOR-NP

PREDATOR

Nor

mal

ized

Runti

me

2326

5.6X

Page 30: Predator : Predictive False Sharing Detection

30

Compiler Instrumenta

tionRuntime System

Thread 1

Thread 2

Cache

Cache

Invalidate

Main Memory

Core 1 Core 2

Precise report

Thread 1 Thread 2

Cache line 1 Cache line 2

Virtual cache line 1Virtual cache line 2

False Sharin

g

Virtual cache line 1

False Sharin

g

Real case

Prediction 1

Prediction 2

Page 31: Predator : Predictive False Sharing Detection

31

Page 32: Predator : Predictive False Sharing Detection

32

False Sharing is Hard to Diagnose

Multiple experts worked together to diagnose MySQL scalability issue (1.5M LOC)

Page 33: Predator : Predictive False Sharing Detection

33

Detailed Prediction Algorithm

1. Find suspected cache lines

Page 34: Predator : Predictive False Sharing Detection

34

Detailed Prediction Algorithm

1. Find suspected cache lines

2. Track detailed memory accesses

Page 35: Predator : Predictive False Sharing Detection

35

Detailed Prediction Algorithm

1. Find suspected cache lines

2. Track detailed memory accesses

3. Predict based on hot accesses

YX

d d < sz && (X, Y) from different

threads, potential false sharing

Page 36: Predator : Predictive False Sharing Detection

36

4: Tracking Cache Invalidations on the Virtual Line

d YX

(sz-d)/2 (sz-d)/2

Tracked virtual line

Non-tracked virtual lines