16
Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov

Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

Vitaly Slobodskoy

Andrey Isakov

Pavel Gerasimov

Page 2: Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

Intel® VTune™ Amplifier

• Performance profiler

• Focus is on hardware performance monitoring capabilities

• Performance Monitoring Unit (PMU)

• Event Based Sampling (EBS)

Samples

TimeThread,

Instruction addressThread,

Instruction address

Interrupt Interrupt

Page 3: Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

Why capturing call stacks is a pain for performance monitoring tools?

• Heavily OS-dependent

• Tricky to capture stack segment on some operating systems

• Using frame pointers is easier to get call stack, but requires all the binaries to be compiled with frame pointers (including system modules)

• Stack unwinding can be easily broken via using custom calling convention (depends on compiler)

• Capturing stack segment introduces additional collection overhead and increases resulting trace files

Page 4: Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

Last Branch Record

• LBR call stack is a list of pairs of From/To program addresses generated by CPU for each call instruction in a specially designated buffer

• TOS (Top of Stack) register points to the current stack element

• Intel® Skylake Microarchitecture introduces the 3rd register called“info” which captures stores branch prediction flag, TSX info, and elapsed cycle data

Call stack

0x699

0x599

0x499

0x399

0x299

0x100

# From To Info

0 0x100 0x200 111

1 0x299 0x300 222

2 0x399 0x400 333

3 0x499 0x500 444

4 0x599 0x600 555

5 0x699 0x700 666

6 0 0 0

Page 5: Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

Limitations

• Up to 32 register pairs (depends on CPU architecture)

• CPU is not aware of OS thread context switches

• List of LBR registers can contain stack frames from various threads

Page 6: Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

Getting valid portion of call stack using LBR

• Using static binary analysis it is possible to validate that current FROM address is reachable from previous TO

• Heuristics have to be appliedas indirect jumps can’t be checked using just static analysis

Page 7: Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

6 0 0 0

Full call stacks for PMU Event Based Sampling

# from to info

0 0x899 0x900 888

1 0x999 0xa00 999

2 0x399 0x400 333

3 0x499 0x500 444

4 0x599 0x600 550

5 0x699 0x700 660

6 0x799 0x800 770

# from to info

0 0x100 0x200 111

2 0x399 0x400 333

5 0x699 0x700 666

1 0x299 0x300 222

4 0x599 0x600 555

3 0x499 0x500 444

2 0x399 0x400 333

3 0x499 0x500 444

• Why don’t we try to stitch LBR-provided stack frames between samples?

2 0x399 0x400 333

3 0x499 0x500 444

Use MSR# and LBR INFO to identify common part

Sample 1 Sample 2

TOS

TOS

Page 8: Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

2 0x399 0x400 3332 0x399 0x400 333

3 0x499 0x500 444

Full call stacks for PMU Event Based Sampling

# from to info

0 0x100 0x200 111

5 0x699 0x700 666

1 0x299 0x300 222

4 0x599 0x600 555

6 0 0 0

3 0x499 0x500 444

0 0x899 0x900 888

1 0x999 0xa00 999

4 0x599 0x600 550

5 0x699 0x700 660

6 0x799 0x800 770

# from to infoCall stack

0x999

0x899

0x799

0x699

0x599

0x499

0x399

0x299

0x100

Page 9: Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

Issue #1 – more than 32 calls/rets of depth between samples

• We are limited by a number of LBR registers

• More than 32 calls/rets between 2 samples and we will not be able to stitch stacks

• Suggested solutions:

• Increase sampling frequency based on related PMU events:

• BR_INST_RETIRED.CALL or similar

• ROB_MISC_EVENTS.LBR_INSERTS

Samples

LBR buffer

Time

Page 10: Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

Issue #2 – thread migration

• HW keeps track for CPU cores, not SW threads

• Need to handle OS context switches in order to keep LBR history for the thread

• Triggering interrupt on every user->kernel mode switch via CPU_CLK_UNHALTED.RING0_TRANS PMU event

• Capture LBRs and TOS when context switch occurs to retain the thread call stack history

• Not every kernel mode transition lead to context switch (e.g. syscall), filtering is required

• Can configure event to trigger interrupt on enter/exit from kernel mode

CPU A

CPU B

Page 11: Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

Issue #3 – C++ exceptions

• C++ exception results in a long jump instead of RET instruction causing imbalance between the number of CALL and RET instructions

• As result LBR call stack contains duplicated entries

• Use the stack pointer to validate if Exception occurred?

Page 12: Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

Other issues

• Low chance to get the right call depth in case of recursion

• INFO register value will not likely be unique

• LBRs are flushed on entering C1 state

• Losing call stack history when core goes into C-state

• Capturing LBRs on exit from kernel mode won’t work

• Need to capture LBRs on enter to kernel mode

Page 13: Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

FFMpeg, Intel® Core™ i7 (SkyLake), 8 cores

• Running multiple FFMpeg processes in parallel.

• Execution time: 29 seconds

• Collection overhead:

• Perf with call stacks: 30-31s, ~7% overhead

• LBR based call stacks with context switch detection: 31-32s, ~8% overhead

* [Stitch point frame] is intentionally added to highlight how algorithm works

Page 14: Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

Mozilla Firefox, Intel® KabyLake, 8 cores

• Video streaming in the browser

Page 15: Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

Python, Intel® Core™ i7 (SkyLake), 8 cores

• Single process running python linter for subdirectory

• Application time: 60s

• Collection overhead:

• Perf with call stacks: 61s, ~2% overhead

• LBR based call stacks with context switch detection: 62s, ~3% overhead

Page 16: Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov · 2019. 1. 9. · Vitaly Slobodskoy Andrey Isakov Pavel Gerasimov. Intel® VTune™ Amplifier • Performance profiler • Focus is

Outcome

• Hardware assisted call stacks is a perspective solution with a number ofadvantages comparing with traditional methods:

• No Operating System dependency, can be applied on RTOSs and custom embedded OSs

• No calling convention and minimal compiler dependency

• Proper handling of C++ exceptions requires additional research