Predator : Predictive False Sharing Detection

Preview:

DESCRIPTION

Predator : Predictive False Sharing Detection. Tongping Liu* , Chen Tian , Ziang Hu , Emery Berger*. *University of Massachusetts Amherst Huawei US Research Center. Parallelism: Expectation is Awesome. Parallel Program.  Expectation. int count[8]; i nt W; void increment( int S) - PowerPoint PPT Presentation

Citation preview

1 UNIVERSITY OF MASSACHUSETTS, AMHERST • School of Computer Science

PREDATOR: Predictive False Sharing Detection

Tongping Liu*, Chen Tian, Ziang Hu, Emery Berger*

*University of Massachusetts AmherstHuawei US Research Center

2

Parallelism: Expectation is Awesome

Ru

nti

me (

s)

1 2 4 80

10

20

30

40

50

60

70

80

90

Number of threads

Expectation

Parallel Program

int count[8];int W; void increment(int S) { for(in=S; in<S+W; in++) for(j=0; j<1M; j++) count[in]++;}

int main(int THREADS) { W=8/THREADS; for(i=0; i<8; i+=W) spawn(increment,i);}

3

1 2 4 80

20

40

60

80

100

120

140

Number of threads

False sharing slows the program by 13X

Ru

nti

me (

s)

Parallel Program Expectation

Reality

Parallelism: Reality is Awful

int count[8];int W; void increment(int S) { for(in=S; in<S+W; in++) for(j=0; j<1M; j++) count[in]++;}

int main(int THREADS) { W=8/THREADS; for(i=0; i<8; i+=W) spawn(increment,i);}

False sharing

4

False Sharing in Real Applications

False sharing slows MySQL by 50%

5

Cache Line

False Sharing vs. True Sharing

6

Task 3Task 1

Task 2 Task 4

False Sharing

Task 1

TrueSharing

Task 2

False Sharing vs. True Sharing

7

Resource Contention at Cache Line Level

8

Thread 1

Main Memory

Core 1

Thread 2

Core 2

Cache

Cache

Invalidate

Cache line: basic unit of data transfer

False Sharing Causes Performance Problems

9

Thread 1 Thread 2

Cache

Cache

Invalidate

Interleaved accesses cause cache invalidations

Main Memory

Core 1 Core 2

False Sharing Causes Performance Problems

10

me = 1;you = 1; // globals

me = new Foo;you = new Bar; // heap

class X { int me; int you;}; // fields

array[me] = 12;array[you] = 13; // array indices

False Sharing is Everywhere

11

False Sharing is Hard to Diagnose

Multiple experts worked together to diagnose MySQL scalability issue (1.5M LOC)

12

Problems of Existing Tools

• No precise information/false positives–WIBA’09, VEE’11, EuroSys’13, SC’13

• Accurate & Precise– OOPSLA’11 ( Cannot detect read-write

FS)Shared problem: only detect observed false sharing

13

Task 1

Task 2

Cache

Cache

Invalidate

Main Memory

Core 1 Core 2

False Sharing Causes Performance Problems

Find cache lines with many cache invalidations

Interleaved accesses

Cache invalidations

Performance problems

Detect false sharing causing performance problems

14

Find Lines with Many Invalidations

. . . . . . .

……

Track cache invalidations on each cache line

Memory: Global, Heap

15

Track Cache Invalidations

• Hardware-based approach– Needs hardware

support– No portability

• Simulation-based approach– Needs hardware info

such as cache hierarchy, cache capacity

– Very slow

• Conservative Assumptions– Each thread runs on a

different core with its private cache.

– Infinite cache capacity.

PREDATOR: based on memory access history of each cache line

16

Track Cache Invalidations

r w r ww r w r

T1 T2

0

# of invalidations

12

Time

30 0 0 0T2 r T1 rT2 w

Each Entry: { Thread ID, Access Type}

T2 w 0 0T1 wT2 w 0 0T1 r

17

PREDATOR Components

Compiler Instrumenta

tion

Runtime System

Instruments every memory read/write

access

Collects memory accesses and reports

false sharing

18

Detect Problems Correctly & Precisely

• Correctly: –No false

alarms

Task 3Task 1

Task 2 Task 4

False Sharing

Task 1

TrueSharing

Task 2

Track memory accesses on each word

• Precisely– Global variables–Heap objects: pinpoint the line of memory

allocation

19

PREDATOR’s Report

20

Why do we need prediction?

21

Necessity of False Sharing Prediction

Thread 1 Thread 2

Cache line 1 Cache line 2

Cache line 1 Cache line 2

False Sharin

g

Cache line 1

False Sharin

g

22

Properties Affecting False Sharing Occurrence

32-bit platform 64-bit platformDifferent memory allocatorDifferent compiler or optimizationDifferent allocation order by changing the

code, e.g., printf

• Change of memory layout

• Run on hardware with different cache line size

23

Example of False Sharing Sensitivity

Offset = 0

Offset = 8

Offset = 56

……

Memory

Colors represent threads

Cache line size = 64 bytes

24

Offse

t=0

Offse

t=8

Offse

t=16

Offse

t=24

Offse

t=32

Offse

t=40

Offse

t=48

Offse

t=56

0

1

2

3

4

5

6

Ru

nti

me (

Secon

ds)

PREDATOR predicts false sharing

problems without occurrence

Example of False Sharing Sensitivity

25

Prediction Based on Virtual Cache Lines

Thread 1 Thread 2

Cache line 1 Cache line 2

Virtual cache line 1 Virtual cache line 2

False Sharin

g

Virtual cache line 1

False Sharin

g

Real case

Prediction 1

Prediction 2

26

d YX

(sz-d)/2 (sz-d)/2

Tracked virtual line

Non-tracked virtual lines

Track Invalidations on Virtual Cache Lines

d < the cache line size - sz(X, Y) from different threads && one of

them is write

27

Benchmark Results

Benchmarks Unknown Problem

Without Prediction

With Prediction Improvement

Histogram ✔ ✔ ✔ 46%

Linear_regression ✔ 1207%

Reverse_index ✔ ✔ 0.09%

Word_count ✔ ✔ 0.14%

Streamcluster-1 ✔ ✔ ✔ 4.77%

Streamcluster-2 ✔ ✔ 7.52%

28

Real Applications Results

• MySQL– Problem: False sharing occurs when

different threads update the shared bitmap simultaneously.

– Performance improves 180% after fixes.

• Boost library:– Problem: “there will be 16 spinlocks per

cache line”– Performance improves about 100%.

29

Performance Overhead of PREDATOR

Phoenix

histogra

m

kmea

ns

linear_

regressi

on

matrix_

multiply pca

revers

e_index

string_m

atch

word_c

ount

PARSEC

blacks

choles

bodytrac

k

dedup

ferret

fluidanimate

strea

mcluste

r

swap

tions x2

64

RealApplica

tionsag

etBoost

Mem

cach

ed

MyS

QL

pbzip2

pfscan

AVERAGE

0

3

6

9

12

15

Execution Time Overhead

Original

PREDATOR-NP

PREDATOR

Nor

mal

ized

Runti

me

2326

5.6X

30

Compiler Instrumenta

tionRuntime System

Thread 1

Thread 2

Cache

Cache

Invalidate

Main Memory

Core 1 Core 2

Precise report

Thread 1 Thread 2

Cache line 1 Cache line 2

Virtual cache line 1Virtual cache line 2

False Sharin

g

Virtual cache line 1

False Sharin

g

Real case

Prediction 1

Prediction 2

31

32

False Sharing is Hard to Diagnose

Multiple experts worked together to diagnose MySQL scalability issue (1.5M LOC)

33

Detailed Prediction Algorithm

1. Find suspected cache lines

34

Detailed Prediction Algorithm

1. Find suspected cache lines

2. Track detailed memory accesses

35

Detailed Prediction Algorithm

1. Find suspected cache lines

2. Track detailed memory accesses

3. Predict based on hot accesses

YX

d d < sz && (X, Y) from different

threads, potential false sharing

36

4: Tracking Cache Invalidations on the Virtual Line

d YX

(sz-d)/2 (sz-d)/2

Tracked virtual line

Non-tracked virtual lines

Recommended