66
Visualizing Performance with Flame Graphs Brendan Gregg Senior Performance Architect Jul 2017 2017 USENIX Annual Technical Conference

USENIX ATC 2017: Visualizing Performance with Flame Graphs

Embed Size (px)

Citation preview

Page 1: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Visualizing Performance with Flame Graphs

Brendan Gregg Senior Performance Architect

Jul 2017

2017 USENIX Annual Technical Conference

Page 2: USENIX ATC 2017: Visualizing Performance with Flame Graphs

VisualizeCPU-meconsumedbyallso5ware

Kernel

Java

User-level

Page 3: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Agenda

1.CPUFlamegraphs

2.FixingStacks&Symbols 3.Advancedflamegraphs

Page 4: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Takeaways

1.  InterpretCPUflamegraphs

2.  UnderstandpiHallswithstacktracesandsymbols

3.  DiscoveropportuniKesforfuturedevelopment

Page 5: USENIX ATC 2017: Visualizing Performance with Flame Graphs
Page 6: USENIX ATC 2017: Visualizing Performance with Flame Graphs

CaseStudy

Exception handling consuming CPU

Page 7: USENIX ATC 2017: Visualizing Performance with Flame Graphs

CPUPROFILINGSummary

Page 8: USENIX ATC 2017: Visualizing Performance with Flame Graphs

CPUProfiling

AB

block interrupt

on-CPU off-CPU

ABA A

BA

syscall

time

•  Record stacks at a timed interval: simple and effective –  Pros: Low (deterministic) overhead –  Cons: Coarse accuracy, but usually sufficient

stack samples: A

Page 9: USENIX ATC 2017: Visualizing Performance with Flame Graphs

StackTraces

•  Acodepathsnapshot.e.g.,fromjstack(1):

$ jstack 1819

[…]

"main" prio=10 tid=0x00007ff304009000 nid=0x7361

runnable [0x00007ff30d4f9000]

java.lang.Thread.State: RUNNABLE

at Func_abc.func_c(Func_abc.java:6)

at Func_abc.func_b(Func_abc.java:16)

at Func_abc.func_a(Func_abc.java:23)

at Func_abc.main(Func_abc.java:27)

running parent g.parent g.g.parent

Page 10: USENIX ATC 2017: Visualizing Performance with Flame Graphs

SystemProfilers•  Linux

–  perf_events(aka"perf")

•  OracleSolaris–  DTrace

•  OSX–  Instruments

•  Windows–  XPerf,WPA(whichnowhasflamegraphs!)

•  Andmanyothers…

Page 11: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Linuxperf_events•  StandardLinuxprofiler

–  Providestheperfcommand(mulK-tool)–  Usuallypkgaddedbylinux-tools-common,etc.

•  Manyeventsources:–  Timer-basedsampling–  Hardwareevents–  Tracepoints–  Dynamictracing

•  Cansamplestacksof(almost)everythingonCPU–  CanmisshardinterruptISRs,buttheseshouldbenear-zero.Theycanbemeasuredifneeded(Iwrote

myowntools).

Page 12: USENIX ATC 2017: Visualizing Performance with Flame Graphs

perfProfiling# perf record -F 99 -ag -- sleep 30[ perf record: Woken up 9 times to write data ][ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]# perf report -n -stdio[…]# Overhead Samples Command Shared Object Symbol# ........ ............ ....... ................. .............................# 20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version | --- xen_hypercall_xen_version check_events | |--44.13%-- syscall_trace_enter | tracesys | | | |--35.58%-- __GI___libc_fcntl | | | | | |--65.26%-- do_redirection_internal | | | do_redirections | | | execute_builtin_or_function | | | execute_simple_command[… ~13,000 lines truncated …]

call tree summary

Page 13: USENIX ATC 2017: Visualizing Performance with Flame Graphs

FullperfreportOutput

Page 14: USENIX ATC 2017: Visualizing Performance with Flame Graphs

…asaFlameGraph

Page 15: USENIX ATC 2017: Visualizing Performance with Flame Graphs

FlameGraphSummary•  VisualizesacollecKonofstacktraces

–  x-axis:alphabeKcalstacksort,tomaximizemerging–  y-axis:stackdepth–  color:random(default),oradimension

•  CurrentlymadefromPerl+SVG+JavaScript–  hBps://github.com/brendangregg/FlameGraph–  Takesinputfrommanydifferentprofilers–  MulKpled3versionsarebeingdeveloped

•  References:–  hcp://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html–  hcp://queue.acm.org/detail.cfm?id=2927301–  "TheFlameGraph"CACM,June2016

Page 16: USENIX ATC 2017: Visualizing Performance with Flame Graphs

FlameGraphInterpretaKon

a()

b() h()

c()

d()

e() f()

g()

i()

Page 17: USENIX ATC 2017: Visualizing Performance with Flame Graphs

FlameGraphInterpretaKon(1/3)Topedgeshowswhoisrunningon-CPU,andhowmuch(width)

a()

b() h()

c()

d()

e() f()

g()

i()

Page 18: USENIX ATC 2017: Visualizing Performance with Flame Graphs

FlameGraphInterpretaKon(2/3)

h()

d()

e()

i()

a()

b()

c()

f()

g()

Top-downshowsancestrye.g.,fromg():

Page 19: USENIX ATC 2017: Visualizing Performance with Flame Graphs

FlameGraphInterpretaKon(3/3)

a()

b() h()

c()

d()

e() f()

g()

i()

WidthsareproporKonaltopresenceinsamplese.g.,comparingb()toh()(incl.children)

Page 20: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Mixed-ModeFlameGraphs•  Hues:

–  green==JIT(eg,Java)–  aqua==inlined

•  ifincluded

–  red==user-level*–  orange==kernel–  yellow==C++

•  Intensity:–  Randomizedto

differenKateframes–  Orhashedon

funcKonname

Java JVM (C++)

Kernel Mixed-Mode

C

*newpaleceusesredforkernelmodulestoo

Page 21: USENIX ATC 2017: Visualizing Performance with Flame Graphs

DifferenKalFlameGraphs•  Hues:

–  red==moresamples–  blue==lesssamples

•  Intensity:–  Degreeofdifference

•  Comparestwoprofiles•  Canshowother

metrics:e.g.,CPI•  Othertypesexist

–  flamegraphdiff

Differential

more less

Page 22: USENIX ATC 2017: Visualizing Performance with Flame Graphs

IcicleGraph

top (leaf) merge

Page 23: USENIX ATC 2017: Visualizing Performance with Flame Graphs

FlameGraphSearch•  Color:magenta

toshowmatchedframes

search button

Page 24: USENIX ATC 2017: Visualizing Performance with Flame Graphs

FlameCharts

•  Flame charts: x-axis is time •  Flame graphs: x-axis is population (maximize merging)

•  Final note: these are useful, but are not flame graphs

fromChromedevtools

Page 25: USENIX ATC 2017: Visualizing Performance with Flame Graphs

STACKTRACINGPiHallsandfixes

Page 26: USENIX ATC 2017: Visualizing Performance with Flame Graphs

BrokenStackTracesareCommonBecause:

A.  ProfilersuseframepointerwalkingbydefaultB.  Compilersreusetheframepointerregisterasageneralpurposeregister:a

(usuallyverysmall)performanceopKmizaKon.

# perf record –F 99 –a –g – sleep 30# perf script[…]java 4579 cpu-clock: 7f417908c10b [unknown] (/tmp/perf-4458.map)

java 4579 cpu-clock: 7f41792fc65f [unknown] (/tmp/perf-4458.map) a2d53351ff7da603 [unknown] ([unknown])[…]

Page 27: USENIX ATC 2017: Visualizing Performance with Flame Graphs

…asaFlameGraph

Broken Java stacks (missing frame pointer)

Page 28: USENIX ATC 2017: Visualizing Performance with Flame Graphs

FixingStackWalkingA.  Framepointer-based

–  FixbydisablingthatcompileropKmizaKon:gcc's-fno-omit-frame-pointer–  Pros:simple,supportedbymanytools–  Cons:mightcostalicleextraCPU

B.  Debuginfo(DWARF)walking–  Cons:costsdiskspace,andnotsupportedbyallprofilers.EvenpossiblewithJIT?

C.  JITrunKmewalkers–  Pros:includemoreinternals,suchasinlinedframes–  Cons:limitedtoapplicaKoninternals-nokernel

D. Lastbranchrecord

Page 29: USENIX ATC 2017: Visualizing Performance with Flame Graphs

FixingJavaStackTraces# perf script[…]java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm… 7fd301861e46 [unknown] (/tmp/perf-8131.map) 7fd30184def8 [unknown] (/tmp/perf-8131.map) 7fd30174f544 [unknown] (/tmp/perf-8131.map) 7fd30175d3a8 [unknown] (/tmp/perf-8131.map) 7fd30166d51c [unknown] (/tmp/perf-8131.map) 7fd301750f34 [unknown] (/tmp/perf-8131.map) 7fd3016c2280 [unknown] (/tmp/perf-8131.map) 7fd301b02ec0 [unknown] (/tmp/perf-8131.map) 7fd3016f9888 [unknown] (/tmp/perf-8131.map) 7fd3016ece04 [unknown] (/tmp/perf-8131.map) 7fd30177783c [unknown] (/tmp/perf-8131.map) 7fd301600aa8 [unknown] (/tmp/perf-8131.map) 7fd301a4484c [unknown] (/tmp/perf-8131.map) 7fd3010072e0 [unknown] (/tmp/perf-8131.map) 7fd301007325 [unknown] (/tmp/perf-8131.map) 7fd301007325 [unknown] (/tmp/perf-8131.map) 7fd3010004e7 [unknown] (/tmp/perf-8131.map) 7fd3171df76a JavaCalls::call_helper(JavaValue*,… 7fd3171dce44 JavaCalls::call_virtual(JavaValue*… 7fd3171dd43a JavaCalls::call_virtual(JavaValue*… 7fd31721b6ce thread_entry(JavaThread*, Thread*)… 7fd3175389e0 JavaThread::thread_main_inner() (/… 7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf… 7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/… 7fd317a7e182 start_thread (/lib/x86_64-linux-gn…

# perf script[…]java 4579 cpu-clock: 7f417908c10b [unknown] (/tmp/…

java 4579 cpu-clock: 7f41792fc65f [unknown] (/tmp/… a2d53351ff7da603 [unknown] ([unkn…[…]

IprototypedJVMframepointers.OraclerewroteitandaddedittoJavaas-XX:+PreserveFramePointer(JDK8u60b19)

Page 30: USENIX ATC 2017: Visualizing Performance with Flame Graphs

FixedStacksFlameGraph

Java stacks (but no symbols, yet)

Page 31: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Inlining

•  Manyframesmaybemissing(inlined)–  FlamegraphmaysKllmakeenoughsense

•  Inliningcanoqenbebetuned–  e.g.Java's-XX:-Inlinetodisable,butcanbe80%slower–  Java's-XX:MaxInlineSizeand-XX:InlineSmallCodecanbetuned

alicletorevealmoreframes:canevenimproveperformance!

•  RunKmescanun-inlineondemand–  SothatexcepKonstacktracesmakesense–  e.g.Java'sperf-map-agentcanun-inline(unfoldallopKon)

Page 32: USENIX ATC 2017: Visualizing Performance with Flame Graphs

StackDepth•  perfhada127framelimit•  NowtunableinLinux4.8

–  sysctl-wkernel.perf_event_max_stack=512–  ThanksArnaldoCarvalhodeMelo!

AJavamicroservicewithastackdepth

of>900brokenstacks

perf_event_max_stack=1024

Page 33: USENIX ATC 2017: Visualizing Performance with Flame Graphs

SYMBOLSFixing

Page 34: USENIX ATC 2017: Visualizing Performance with Flame Graphs

FixingNaKveSymbolsA.  Adda-dbgsympackage,ifavailableB.  Recompilefromsource

Page 35: USENIX ATC 2017: Visualizing Performance with Flame Graphs

FixingJITSymbols(Java,Node.js,…)•  Just-in-KmerunKmesdon'thaveapre-compiledsymboltable•  SoLinuxperflooksforanexternallyprovidedJITsymbol

file:/tmp/perf-PID.map

•  ThiscanbecreatedbyrunKmes;eg,Java'sperf-map-agent

# perf scriptFailed to open /tmp/perf-8131.map, continuing without symbols[…]java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm… 7fd301861e46 [unknown] (/tmp/perf-8131.map)[…]

Page 36: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Java Mixed-Mode Flame Graph

FixedStacks&Symbols

Java JVM

Kernel

GC

Page 37: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Stacks&Symbols(zoom)

Page 38: USENIX ATC 2017: Visualizing Performance with Flame Graphs

SymbolChurn•  ForJITrunKmes,symbolscanchangeduringaprofile•  Symbolsmaybemistranslatedbyperf'smapsnapshot•  SoluKons:

A.  Takeabefore&aqersnapshot,andcompareB.  perf'snewsupportforKmestampedsymbollogs

Page 39: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Containers•  perfcan'tfindanysymbolsources

–  Unlessyoucopythemintothehost

•  I'mtesKngKristerJohansen'sfix,hopefullyforLinux4.13–  lkml:"[PATCHKp/perf/core0/7]namespacetracingimprovements"

Page 40: USENIX ATC 2017: Visualizing Performance with Flame Graphs

INSTRUCTIONSForLinux

Page 41: USENIX ATC 2017: Visualizing Performance with Flame Graphs

LinuxCPUFlameGraphsLinux2.6+,viaperf.dataandperfscript:

Linux4.5+canusefoldedoutput

–  SkipstheCPU-costlystackcollapse-perf.plstep;see:hcp://www.brendangregg.com/blog/2016-04-30/linux-perf-folded.html

Linux4.9+,viaBPF:

–  Mostefficient:noperf.datafile,summarizesin-kernel

git clone --depth 1 https://github.com/brendangregg/FlameGraphcd FlameGraphperf record -F 99 -a –g -- sleep 30perf script | ./stackcollapse-perf.pl |./flamegraph.pl > perf.svg

git clone --depth 1 https://github.com/brendangregg/FlameGraphgit clone --depth 1 https://github.com/iovisor/bcc./bcc/tools/profile.py -dF 99 30 | ./FlameGraph/flamegraph.pl > perf.svg

Page 42: USENIX ATC 2017: Visualizing Performance with Flame Graphs

perf record

perf script

capturestacks

writetext

stackcollapse-perf.pl

flamegraph.pl

perf.data

writesamples

readssamples

foldedoutput

perf record

perf report –g folded

capturestacks

foldedreport

awk

flamegraph.pl

perf.data

writesamples

readssamples

foldedoutput

Linux4.5countstacks(BPF)

foldedoutput

flamegraph.pl

profile.py

Linux4.9

LinuxProfilingOpKmizaKonsLinux2.6

Page 43: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Language/RunKmeInstrucKons•  Eachmayhavespecialstack/symbolinstrucKons

–  Java,Node.js,Python,Ruby,C++,Go,…

•  I'mdocumenKngsomeon:–  hcp://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html–  AlsotryanInternetsearch

Page 44: USENIX ATC 2017: Visualizing Performance with Flame Graphs

GUIAutomaKon

Flame Graphs

Eg,NeHlixVector(self-serviceUI):

Shouldbeopensourced;youmayalsobuild/buyyourown

Page 45: USENIX ATC 2017: Visualizing Performance with Flame Graphs

ADVANCEDFLAMEGRAPHSFutureWork

Page 46: USENIX ATC 2017: Visualizing Performance with Flame Graphs

FlamegraphscanbegeneratedforstacktracesfromanyLinuxeventsource

Page 47: USENIX ATC 2017: Visualizing Performance with Flame Graphs

PageFaults•  Showwhattriggeredmainmemory(resident)togrow:

•  "fault"as(physical)mainmemoryisallocatedon-demand,whenavirtualpageisfirstpopulated

•  Lowoverheadtooltosolvesometypesofmemoryleak

# perf record -e page-faults -p PID -g -- sleep 120

RES column in top(1) grows because

Page 48: USENIX ATC 2017: Visualizing Performance with Flame Graphs

OtherMemorySources

hcp://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html

Page 49: USENIX ATC 2017: Visualizing Performance with Flame Graphs

ContextSwitches•  ShowwhyJavablockedandstoppedrunningon-CPU:

•  IdenKfieslocks,I/O,sleeps–  Ifcodepathshouldn'tblockandlooksrandom,it'saninvoluntarycontextswitch.Icouldfilterthese,butyoushould

havesolvedthembeforehand(CPUload).

•  e.g.,wasusedtounderstandframeworkdifferences:

# perf record -e context-switches -p PID -g -- sleep 5

vs

rxNetty Tomcat

futex

sys_poll

epoll futex

Page 50: USENIX ATC 2017: Visualizing Performance with Flame Graphs

DiskI/ORequests•  ShowswhoissueddiskI/O(syncreads&writes):

•  e.g.:pagefaultsinGC?ThisJVMhasswappedout!:# perf record -e block:block_rq_insert -a -g -- sleep 60

GC

Page 51: USENIX ATC 2017: Visualizing Performance with Flame Graphs

TCPEvents•  TCPtransmit,usingdynamictracing:

•  Note:canbehighoverheadforhighpacketrates–  Forthecurrentperftrace,dump,post-processcycle

•  CanalsotraceTCPconnect&accept–  Lowerfrequency,thereforeloweroverhead

•  TCPreceiveisasync–  Couldtraceviasocketread

# perf probe tcp_sendmsg# perf record -e probe:tcp_sendmsg -a -g -- sleep 1; jmaps# perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace > out.stacks# perf probe --del tcp_sendmsg

TCP sends

Page 52: USENIX ATC 2017: Visualizing Performance with Flame Graphs

CPUCacheMisses•  Inthisexample,samplingviaLastLevelCacheloads:•  -cisthecount(samples

oncepercount)•  UseotherCPUcountersto

samplehits,misses,stalls

# perf record -e LLC-loads -c 10000 -a -g -- sleep 5; jmaps# perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso > out.stacks

Page 53: USENIX ATC 2017: Visualizing Performance with Flame Graphs

CPIFlameGraph•  CyclesPerInstrucKon

–  red==instrucKonheavy–  blue==cycleheavy

(likelymemorystallcycles)

zoomed:

Page 54: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Off-CPUAnalysis

Off-CPUanalysisisthestudyofblockingstates,orthecode-path(stacktrace)thatledtothesestates

Page 55: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Off-CPUTimeFlameGraph

Moreinfohcp://www.brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html

Stack depth Off-CPU time

Page 56: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Off-CPUTime(zoomed):tar(1)

file read from disk

directory read from disk

Onlyshowingkernelstacksinthisexample

pipe write path read from disk

fstat from disk

Page 57: USENIX ATC 2017: Visualizing Performance with Flame Graphs

CPU+Off-CPUFlameGraphs:SeeEverything

hcp://www.brendangregg.com/flamegraphs.html

CPU

Off-CPU

Page 58: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Off-CPUTime(zoomed):gzip(1)

The off-CPU stack trace often doesn't show the root cause of latency. What is gzip blocked on?

Page 59: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Off-WakeTimeFlameGraph

UsesLinuxenhancedBPFtomergeoff-CPUandwakerstackinkernelcontext

Page 60: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Off-WakeTimeFlameGraph(zoomed)Wakertask

Wakerstack

Blockedstack

Blockedtask

StackDirecKon

Wokeup

Page 61: USENIX ATC 2017: Visualizing Performance with Flame Graphs

ChainGraphs

Walkingthechainofwakeupstackstoreachrootcause

Page 62: USENIX ATC 2017: Visualizing Performance with Flame Graphs

HotColdFlameGraphsIncludesbothCPU&Off-CPU(orchain)stacksinoneflamegraph•  However,Off-CPUKmeoqen

dominates:threadswaiKngorpolling

hcp://www.brendangregg.com/FlameGraphs/hotcoldflamegraphs.html

Page 63: USENIX ATC 2017: Visualizing Performance with Flame Graphs

FlameGraphDiff

hcps://github.com/corpaul/flamegraphdiff

Page 64: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Takeaways1.  InterpretCPUflamegraphs2.  UnderstandpiHallswithstacktracesandsymbols3.  DiscoveropportuniKesforfuturedevelopment

Page 65: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Links&References•  FlameGraphs

–  "TheFlameGraph"Communica-onsoftheACM,Vol.56,No.6(June2016)–  hcp://queue.acm.org/detail.cfm?id=2927301–  hcp://www.brendangregg.com/flamegraphs.html->hBp://www.brendangregg.com/flamegraphs.html#Updates–  hcp://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html–  hcp://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html–  hcp://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html–  hcp://techblog.neHlix.com/2015/07/java-in-flames.html–  hcp://techblog.neHlix.com/2016/04/saving-13-million-computaKonal-minutes.html–  hcp://techblog.neHlix.com/2014/11/nodejs-in-flames.html–  hcp://www.brendangregg.com/blog/2014-11-09/differenKal-flame-graphs.html–  hcp://www.brendangregg.com/blog/2016-01-20/ebpf-offcpu-flame-graph.html–  hcp://www.brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html–  hcp://www.brendangregg.com/blog/2016-02-05/ebpf-chaingraph-prototype.html–  hcp://corpaul.github.io/flamegraphdiff/

•  Linuxperf_events–  hcps://perf.wiki.kernel.org/index.php/Main_Page–  hcp://www.brendangregg.com/perf.html

•  NeHlixVector–  hcps://github.com/neHlix/vector–  hcp://techblog.neHlix.com/2015/04/introducing-vector-neHlixs-on-host.html

Page 66: USENIX ATC 2017: Visualizing Performance with Flame Graphs

Thank You

–  QuesKons?–  hcp://www.brendangregg.com–  hcp://slideshare.net/brendangregg–  [email protected]–  @brendangregg

Nexttopic:PerformanceSuperpowerswithEnhancedBPF

2017 USENIX Annual Technical Conference