Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

ROME Workshop, August 23, 2016

DEALING WITH LAYERS OF OBFUSCATIONIN PSEUDO-UNIFORM MEMORY

Robert Kuban, Mark Simon Schops, Jorg Nolte,Randolf Rotta [email protected]

1Research supported by German BMBF grant 01IH13003.

PROBLEM: MEMORY LATENCY ON INTEL XEON PHI KNC

Example: Measuring avg. time is unstable between restarts

Affects: micro-benchmarks,algorithm tuning,developer’s sanity. . .also application performance?

⇒ Outline

1. Causes?

2. Solutions?

3. Is it worthwhile?

2

OUTLINE

1. Causes?

2. Solutions?


4. Conclusions

1 ·Causes? 3

CAUSES: MULTIPLE PERFORMANCE BOTTLENECKS

1. compute bound

2. memory throughput:streaming, matrix alg.

3. memory latency:key-value stores, graphs

4. coherence latency:synchronisation variable

5. coherence throughput:many sync. variables

corecache

coherence directory

memory

corecache

1

2,3

4,5

1 ·Causes? 4

HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT

1. striping over memory channels, banks, and coherence directories2. past: NUMA throughput bottlenecks⇒ mostly local striping3. many-cores: no throughput bottlenecks but larger network

corecache

directory

memory memory

directory

memory memory

corecache

corecache

directory

memory memory

corecache

1 ·Causes? 5



corecache

directory

memory memory

directory

memory memory

corecache

corecache

directory

memory memory

corecache

1 ·Causes? 5



corecache

directory

memory memory

directory

memory memory

corecache

corecache

directory

memory memory

corecache

directory

1 ·Causes? 5

INTEL XEON PHI KNC IN DETAIL

4 threadsL2 cache

directory

2xmemory

ctrl

4 threadsL2 cache

directory

57 - 61 cores

64 directories

bi-directional ring network

2xmemory

ctrl

2xmemory

ctrl

2xmemory

ctrl

memory striping by (PhysAddr/62)&0xF.1

avg. remote L2 read ≈ 240 cycles, contention >16 threads.2

some lines near to memory, up to 28% app. speedup possible.3

1John McCalpin: https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/5861382Ramos et al: Modeling communication in cache-coherent SMP systems: A case-study with Xeon Phi.3Balazs Gerofi et al: Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs

1 ·Causes? 6

https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/586138

OUTLINE

1. Causes?

2. Solutions?


4. Conclusions

2 ·Solutions? 7

REVERSE ENGINEERING KNC’S DIRECTORY STRIPING

measure: fetch line currently owned by neighbour L2two cores, two lines: one for measurement, other for coordinationminimum RDTSC cycles, MyThOS kernel as bare-metal env.

core

coreL2 cacheL2 cache

L2 cache L2 cache

directory

directory

1 2

3

1

2

3

core

core

2 ·Solutions? 8

RESULTS: PSEUDO-RANDOMLY SCATTERED≈140 cycles best case vs. ≈400 cycles worst case

100

200

300

400

0 256 512 768 1024cache line

late

ncy

from

cor

e 0

to 1

2 ·Solutions? 9

RESULTS: RECONSTRUCTED MAPPING OF LINES TO DIRECTORIESEnables quick initialisation without measurements

100

200

300

400

0 16 32 48 64tag directory

late

ncy

from

cor

e 0

to 1

2 ·Solutions? 10

IMPLICATIONS

Support in the MyThOS kernel

per page: base address for line 7→ directory

per node: balanced mapping for directory 7→ nearby core

kernel objects can allocate local lines for sync. vars.

Application challenges

avoid >16 threads accessing same line

co-locate dependent tasks

squeeze synchronisation into cache lines

no easy migration after allocation

2 ·Solutions? 11

OUTLINE

1. Causes?

2. Solutions?


4. Conclusions

3 · Is it worthwhile? 12

PING-PONG BENCHMARK: BUSY POLLING, THEN WRITE

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●

●

●

●●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●

●

●●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●

●

●

●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●

●

●

●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●

●

●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●●

●●●●●●●●●●●●●●●●

●

●

●

●●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●

●

●

●

●●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●

●

●

●

●●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●●

●

●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●●

●

●

●●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●

●

●●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●

●

●

●●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●●

●

●●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●●

●

●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●

●

●

●●

●

●●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●

●

●●●●●

●

●●●●●

●

●

●

●

●

●●●

●

●●●●

●

●●●●●●●

●

●●

●

●●

●

●

●

●●●●

●

●●

●

●●●●●●

●

●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●

●

●●●●●●●●

●

●●●

●●●●●

●

●●●

●

●●●●

●●●●

●

●●●●●

●●●●●●

●

●●●●●●●●

●

●●●●●●●●

●●

●

●

●●

●●●●●

●

●

●●●

●

●●●●●●●

●

●●●●●●●●

●

●

●

●●●●●●●●●

●

●●●●●●●●

●

●

●

●●●●●●●●

●

●

●●●●●●●●

●

●●●●●●●●●●●

●

●●

●

●●●●●●

●

●●

●

●●●●

●

●

●

●

●●●

●

●●

●

●●●●●●

●

●●

●

●●●●

●

●●●●●

●

●●●●●●●●

●

●●●●●●●●●●●

●

●●●

●●●●●

●

●●

●●●●●●●●

●

●

●●●●●●●●

●

●●●●●●●●●●●

●

●●●●●●●●

●

●●

●

●●●●

●

●●

●

●●●

●

●●

●

●●●●●●

●

●●

●

●●●●

●

●●

●

●●●

●

●●

●

●●●●●●

●

●●

●

●●●●

●

●●

●

●●●

●

●●●●●●●

●

●

●●●●●●●●●●●●●●●●●●

●

●

●

●●●●●●●●●

●

●

●●●●●●●●

●

●

●●●●

●●●

●●●

●

●●●●●●●●

●

●●●●●●●●●●●

●

●●●●●●●●

●

●●●●●●●●●●●

●

●●●●●●●●

●

●●●●●●●●

●●●

●

●●●●●●●●

●

●●

●

●●●●

●

●●

●

●●●

●

●●●●●●●●

●

●●

●

●●●●

●

●●

●

●●●

●

●●

●

●●●●●●

●

●

●

●

●●●●

●

●●

●

●●●

●

●●●●●●●●

●

●●●●●●●●●●●

●

●●●●●●●●

●

●●●

●

●●●●●●●

●

●●●●●●●●

●

●●●●●●●●●●

●

●

●●

●●●●●●

●

●●●●●●●●●●●

●

●●●●●●●●

●

●●

●

●●●●

●

●●

●

●●●

●

●●

●

●●●●●●

●

●●

●

●●●●

●

●●

●

●●●

●

●●

●

●●●●●●

●

●●●●●●●●●●●

●

●●●●●●

●●

●

●●●●●●●

●●●

●

●

●●●●●●●●

●

●●●●●●●●●

●

●

●

●●

●●●●●●

●

●●●●●●●●●

●

●

●

●●●●●●●●

●

●●●

●●

●●●●●●

●

●●●●●●

●●

●

●

●●●●●●●●●●

●

●●●●●●●●

●

●●●●●●●●●●

●

●

●●●●●●●●

●

●●●●●●●●●●●

●

●●●●●●●●

●

●●

●

●●●●

●

●●

●

●●●

●

●●●●●●●●

●

●●●●●

●

●●●●●

●

●●

●

●●●●●●

●

●●●●

●

●●●

●●

●

●

●●●●●●●●

●

●●●●●●●●

●●

●

●

●●●●●●●●

●

●●

●

●●●●

●

●●

●

●●●

●

●●

●

●●●●●●

●

●●

●

●●●●

●

●●

●

●●●

●

●●

●

●●●●●

●

●

●●

●

●●

●●

●

●●

●

●●

●●●●●●●●●

●

●●●●

●

●●●●●●

●

●●●●●●●●

●

●●●●●●●●●●●

●

●●●●●●●●

●

●●●●●●●●●●

●

●

●●●●●●●●

●

●●●●●●●●●●●

●

●●●●●●●●

●

●

●●

●●●●●●●●

●

●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●

●

●●●●●●●●

●

●●●●●●●●●●●

●

●●●●●●●

●

●

●●

●

●

●●●

●

●●

●

●●●

●

●●

●

●●●●●●

●

●●

●

●●●●

●

●●

●

●●●

●

●●

●

●●●●●●

●

●●

●

●●●●

●

●●

●

●●●

●

●●

●

●●●●●●

●

●●

●

●●●●

●

●●

●

●

●

●●●

●

●●

●

●●

●

●

●●

●●●●●●

●

●●

●

●●●●●●●●●●●●●●●●

read

200

400

600

800

1000

1200

1400

1600

1 15 30distance between the cores

mea

n la

tenc

y [n

s]

Placement: worst best


PING-PONG BENCHMARK: TIMES DON’T ADD UP

core

3:copy

1:read2

5:invalidationbroadcast

L2 cachecore

4:rfo

6:ack

L2 cache

directory


PING-PONG BENCHMARK: AVOID INVALIDATION BROADCASTS!

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

atomic fetch−and−add

200

400

600

800

1000

1200

1400

1600

1 15 30distance between the cores

mea

n la

tenc

y [n

s]

Placement: worst best


INTEL XEON PHI KNL: DOES IT APPLY?

DDR4 DDR4

MCDRAM MCDRAM MCDRAM MCDRAM

MCDRAM MCDRAM MCDRAM MCDRAM

1

2

3

modes: all2all, quadrant, sub-numa; as memory or L3 cache

benchmarks4: quadrant > all2all > sub-numa

⇒ memory + directory striping persistssmaller latency? overhead of Y-X crossing?

4Carlos Rosales: A Comparative Study of Application Performance and Scalability on the Intel Knights Landing Processor


OUTLINE

1. Causes?

2. Solutions?


4. Conclusions

4 ·Conclusions 17

CONCLUSIONS

memory striping 6= directory striping

good for throughput-bound computations,bad for latency- and synchronisation-bound computations

Intel KNC: pseudo-uniform

up to 3x synchronisation latencybut avoiding broadcasts and contention equally important

benchmarks: average over multiple random allocations

Future. . .

MyThOS: evaluate impact on in-kernel synchronisation

Intel KNL: latency and contention benchmarks

HW: dedicated memory/network for synchronisation !?

4 ·Conclusions 18

OUTLINE

5. Appendix

5 ·Appendix 19

RESULTS: UNEVEN MAPPING, DEPENDS ON ENABLED CORES!

0%

1%

2%

3%

4%

0 20 40 60core

near

est c

ache

line

s

5 ·Appendix 20

READING FROM MEMORY: LATENCY FROM CORE 0

100

200

300

400

0 1024 2048 3072 4096 5120 6144 7168 8192cache line

late

ncy

from

cor

e 0

5 ·Appendix 21

READING FROM MEMORY: LATENCY FROM BEST CORE

100

200

300

400

0 1024 2048 3072 4096 5120 6144 7168 8192cache line

late

ncy

from

bes

t cor

e

5 ·Appendix 22

“PSEUDO-UNIFORM” MEMORY ARCHITECTURES

Good for throughput bound computations

HW maximises average throughput over large data sets,average latency hidden by prefetching & many threads

⇒ no need for data partitioning and placement,can focus on computation balance

Bad for latency and synchronisation bound computations

most synchronisation variables are very small,prefetching does not help

average latency does not apply,permanent overhead depending on placement

5 ·Appendix 23

Documents

Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory