25
ROME Workshop, August 23, 2016 DEALING WITH LAYERS OF OBFUSCATION IN PSEUDO-UNIFORM MEMORY Robert Kuban, Mark Simon Sch ¨ ops, J ¨ org Nolte, Randolf Rotta [email protected] 1 Research supported by German BMBF grant 01IH13003.

Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

ROME Workshop, August 23, 2016

DEALING WITH LAYERS OF OBFUSCATIONIN PSEUDO-UNIFORM MEMORY

Robert Kuban, Mark Simon Schops, Jorg Nolte,Randolf Rotta [email protected]

1Research supported by German BMBF grant 01IH13003.

Page 2: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

PROBLEM: MEMORY LATENCY ON INTEL XEON PHI KNC

Example: Measuring avg. time is unstable between restarts

Affects: micro-benchmarks,algorithm tuning,developer’s sanity. . .also application performance?

⇒ Outline

1. Causes?

2. Solutions?

3. Is it worthwhile?

2

Page 3: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

OUTLINE

1. Causes?

2. Solutions?

3. Is it worthwhile?

4. Conclusions

1 ·Causes? 3

Page 4: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

CAUSES: MULTIPLE PERFORMANCE BOTTLENECKS

1. compute bound

2. memory throughput:streaming, matrix alg.

3. memory latency:key-value stores, graphs

4. coherence latency:synchronisation variable

5. coherence throughput:many sync. variables

corecache

coherence directory

memory

corecache

1

2,3

4,5

1 ·Causes? 4

Page 5: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT

1. striping over memory channels, banks, and coherence directories2. past: NUMA throughput bottlenecks⇒ mostly local striping3. many-cores: no throughput bottlenecks but larger network

corecache

directory

memory memory

directory

memory memory

corecache

corecache

directory

memory memory

corecache

1 ·Causes? 5

Page 6: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT

1. striping over memory channels, banks, and coherence directories2. past: NUMA throughput bottlenecks⇒ mostly local striping3. many-cores: no throughput bottlenecks but larger network

corecache

directory

memory memory

directory

memory memory

corecache

corecache

directory

memory memory

corecache

1 ·Causes? 5

Page 7: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT

1. striping over memory channels, banks, and coherence directories2. past: NUMA throughput bottlenecks⇒ mostly local striping3. many-cores: no throughput bottlenecks but larger network

corecache

directory

memory memory

directory

memory memory

corecache

corecache

directory

memory memory

corecache

directory

1 ·Causes? 5

Page 8: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

INTEL XEON PHI KNC IN DETAIL

4 threadsL2 cache

directory

2xmemory

ctrl

4 threadsL2 cache

directory

57 - 61 cores

64 directories

bi-directional ring network

2xmemory

ctrl

2xmemory

ctrl

2xmemory

ctrl

memory striping by (PhysAddr/62)&0xF.1

avg. remote L2 read ≈ 240 cycles, contention >16 threads.2

some lines near to memory, up to 28% app. speedup possible.3

1John McCalpin: https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/5861382Ramos et al: Modeling communication in cache-coherent SMP systems: A case-study with Xeon Phi.3Balazs Gerofi et al: Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs

1 ·Causes? 6

Page 9: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

OUTLINE

1. Causes?

2. Solutions?

3. Is it worthwhile?

4. Conclusions

2 ·Solutions? 7

Page 10: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

REVERSE ENGINEERING KNC’S DIRECTORY STRIPING

measure: fetch line currently owned by neighbour L2two cores, two lines: one for measurement, other for coordinationminimum RDTSC cycles, MyThOS kernel as bare-metal env.

core

coreL2 cacheL2 cache

L2 cache L2 cache

directory

directory

1 2

3

1

2

3

core

core

2 ·Solutions? 8

Page 11: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

RESULTS: PSEUDO-RANDOMLY SCATTERED≈140 cycles best case vs. ≈400 cycles worst case

100

200

300

400

0 256 512 768 1024cache line

late

ncy

from

cor

e 0

to 1

2 ·Solutions? 9

Page 12: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

RESULTS: RECONSTRUCTED MAPPING OF LINES TO DIRECTORIESEnables quick initialisation without measurements

100

200

300

400

0 16 32 48 64tag directory

late

ncy

from

cor

e 0

to 1

2 ·Solutions? 10

Page 13: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

IMPLICATIONS

Support in the MyThOS kernel

per page: base address for line 7→ directory

per node: balanced mapping for directory 7→ nearby core

kernel objects can allocate local lines for sync. vars.

Application challenges

avoid >16 threads accessing same line

co-locate dependent tasks

squeeze synchronisation into cache lines

no easy migration after allocation

2 ·Solutions? 11

Page 14: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

OUTLINE

1. Causes?

2. Solutions?

3. Is it worthwhile?

4. Conclusions

3 · Is it worthwhile? 12

Page 15: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

PING-PONG BENCHMARK: BUSY POLLING, THEN WRITE

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●

●●●●●

●●●●●●●●●●●●●●●●●●

●●●

●●●●●

●●●●●

●●●

●●●●

●●●●●●●

●●

●●

●●●●

●●

●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●

●●●●●

●●●

●●●●

●●●●

●●●●●

●●●●●●

●●●●●●●●

●●●●●●●●

●●

●●

●●●●●

●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●

●●●●●●

●●

●●●●

●●●

●●

●●●●●●

●●

●●●●

●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●

●●●●●

●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●

●●●●

●●●

●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●

●●●●●●●●

●●

●●●●

●●

●●●

●●●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●●●

●●

●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●

●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●●●●●●●●●●

●●●●●●

●●

●●●●●●●

●●●

●●●●●●●●

●●●●●●●●●

●●

●●●●●●

●●●●●●●●●

●●●●●●●●

●●●

●●

●●●●●●

●●●●●●

●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●

●●●●

●●

●●●

●●●●●●●●

●●●●●

●●●●●

●●

●●●●●●

●●●●

●●●

●●

●●●●●●●●

●●●●●●●●

●●

●●●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●

●●●●●●●●●

●●●●

●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●

●●●

●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●●●●

●●

●●●●●●●●●●●●●●●●

read

200

400

600

800

1000

1200

1400

1600

1 15 30distance between the cores

mea

n la

tenc

y [n

s]

Placement: worst best

3 · Is it worthwhile? 13

Page 16: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

PING-PONG BENCHMARK: TIMES DON’T ADD UP

core

3:copy

1:read2

5:invalidationbroadcast

L2 cachecore

4:rfo

6:ack

L2 cache

directory

3 · Is it worthwhile? 14

Page 17: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

PING-PONG BENCHMARK: AVOID INVALIDATION BROADCASTS!

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

atomic fetch−and−add

200

400

600

800

1000

1200

1400

1600

1 15 30distance between the cores

mea

n la

tenc

y [n

s]

Placement: worst best

3 · Is it worthwhile? 15

Page 18: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

INTEL XEON PHI KNL: DOES IT APPLY?

DDR4 DDR4

MCDRAM MCDRAM MCDRAM MCDRAM

MCDRAM MCDRAM MCDRAM MCDRAM

1

2

3

modes: all2all, quadrant, sub-numa; as memory or L3 cache

benchmarks4: quadrant > all2all > sub-numa

⇒ memory + directory striping persistssmaller latency? overhead of Y-X crossing?

4Carlos Rosales: A Comparative Study of Application Performance and Scalability on the Intel Knights Landing Processor

3 · Is it worthwhile? 16

Page 19: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

OUTLINE

1. Causes?

2. Solutions?

3. Is it worthwhile?

4. Conclusions

4 ·Conclusions 17

Page 20: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

CONCLUSIONS

memory striping 6= directory striping

good for throughput-bound computations,bad for latency- and synchronisation-bound computations

Intel KNC: pseudo-uniform

up to 3x synchronisation latencybut avoiding broadcasts and contention equally important

benchmarks: average over multiple random allocations

Future. . .

MyThOS: evaluate impact on in-kernel synchronisation

Intel KNL: latency and contention benchmarks

HW: dedicated memory/network for synchronisation !?

4 ·Conclusions 18

Page 21: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

OUTLINE

5. Appendix

5 ·Appendix 19

Page 22: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

RESULTS: UNEVEN MAPPING, DEPENDS ON ENABLED CORES!

0%

1%

2%

3%

4%

0 20 40 60core

near

est c

ache

line

s

5 ·Appendix 20

Page 23: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

READING FROM MEMORY: LATENCY FROM CORE 0

100

200

300

400

0 1024 2048 3072 4096 5120 6144 7168 8192cache line

late

ncy

from

cor

e 0

5 ·Appendix 21

Page 24: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

READING FROM MEMORY: LATENCY FROM BEST CORE

100

200

300

400

0 1024 2048 3072 4096 5120 6144 7168 8192cache line

late

ncy

from

bes

t cor

e

5 ·Appendix 22

Page 25: Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI KNC IN DETAIL 4 threads L2 cache directory 2x memory ctrl 4 threads L2 cache directory

“PSEUDO-UNIFORM” MEMORY ARCHITECTURES

Good for throughput bound computations

HW maximises average throughput over large data sets,average latency hidden by prefetching & many threads

⇒ no need for data partitioning and placement,can focus on computation balance

Bad for latency and synchronisation bound computations

most synchronisation variables are very small,prefetching does not help

average latency does not apply,permanent overhead depending on placement

5 ·Appendix 23