Hardware Latencies How to flush them out (A use case)trace-cmd ksoftirqd/33-216 [033] 55597.719935:...

Preview:

Citation preview

Hardware LatenciesHow to flush them out

(A use case)

Steven RostedtRed Hat

Here’s a story, of a lovely lady...

● No this isn’t the Brady Bunch– Nor is it about a lovely lady

– But it probably could have been a Brady Bunch episode.

Here’s a story, of an upset customer

● Who were seeing lots of latencies on their own

● The machine wasn’t verified yet● Real time requires not just a kernel

– Requires the entire spectrum● Application● Kernel● Hard ware

Here’s a story, of an upset customer

● Who were seeing lots of latencies on their own

● The machine wasn’t verified yet● Real time requires not just a kernel

– Requires the entire spectrum● Application● Kernel● Hard ware

Verification of Hardware

● rteval– A tool by Red Hat to stress the machine

– Measures jitter (using cyclictest)

● Was a large machine– 40 CPUs

– For such a box, we expect no more than● 200us jitter

– Like less, but we are lenient with large HW

Latencies

● Seeing 500us latencies!!!!– May not sound big to you

– But it's huge for PREEMPT_RT

● Took a while to hit that● Was it HW? SW?

– We control the app (rteval)

– Of course I blamed the HW ;-)

– Of course the HW vendor blamed SW

The Enemy● 500 microsecond latency

The Weapons● Function tracing● Latency tracers● HW Lat detector● Event tracing● trace_printk()

Function Tracing

● echo function > current_tracer● echo function_graph > current_tracer● trace-cmd is nicer

– trace-cmd start -p function_graph

– trace-cmd stop

– trace-cmd extract

– trace-cmd report

rteval

● hackbench● kernel builds● cyclictest● rteval --duration=100h

rteval

● Breaking it up– rteval --onlyload --duration=100h

● Does not run cyclictest

– Run cyclictest separately

cyclictest● cyclictest --numa -p95 -d0 -i100 -qm

– numa implies -a -t -n● a - bind a task to each CPU● t - thread per CPU● n - use nanosleep() not signals

– p95 - set priority to 95

– d0 - all threads run same interval

– i100 - sleep for 100 us

– q - quiet - don't show status during test

– m - mlockall memory

cyclictest● cyclictest --numa -p95 -d0 -i100 -qm -b 200

– b 200 - break after 200 us latency

– implies running function tracer

– Stops tracer on latency too

● Function tracing adds a lot of overhead!● cyclictest --numa -p95 -d0 -i100 -qm -b 1000

– increase breakpoint by a lot!

cyclictest● Function tracing adds too much overhead● cyclictest --numa -p95 -d0 -i100 -qm -b 200 -E

– E - uses event tracing instead of function

– Better with latencies

– Not as much info

cyclictest● Limit function tracing with trace-cmd

– trace-cmd start -p function -n '*lock*

– trace-cmd start -p function -l '*sched*''

● cyclictest --numa -p95 -d0 -i100 -qm -b 300

Latency Tracers

● wakeup-rt– Ignore wakeup tracer

● preemptirqsoff– Just ignore the:

● irqsoff● preemptoff

Wakeup-rt● trace-cmd start -p wakeup-rt● Records the time of the highest -rt task

– From wake up to schedule

● Problems– defaults running function tracer

– trace-cmd start -p wakeup-rt -d● disables function tracing

– Not much info without functions

– trace-cmd start -p wakeup-rt -d -e all● enables all events

Wakeup-rt

● Didn't help :-(– Not enough info with events

– Function tracing caused latencies● Hard to determine if latency was real or

Heisenbug

preemptirqsoff

● trace-cmd start -p preemptirqsoff -e all -d● Showed us issues with the scheduler● Pointed to load balancing

– but that was a symptom not the cause

Modified cyclictest

● Changed to use function graph instead of function

● trace-cmd -p function -l load_balance

<idle>-0 [000] 60085.036305: function: load_balance<idle>-0 [001] 60085.036305: function: load_balance<idle>-0 [000] 60085.036306: function: load_balance

Modified cyclictest

● trace-cmd -p function_graph -l load_balance– Much more useful

<idle>-0 [002] 60305.482591: 0.795 us | load_balance();<idle>-0 [003] 60305.482591: 1.035 us | load_balance();<idle>-0 [002] 60305.482593: 0.978 us | load_balance();<idle>-0 [003] 60305.482593: 0.456 us | load_balance();

Latency without Load Balance?

● Hit a latency, and load balance wasn't called?

● PREEMPT_RT converts spinlocks to mutex– except for raw_spin_locks!

● trace-cmd start -p function_graph \

-l '*raw_spin_lock*'

<idle>-0 24dN.10 111214.800190: funcgraph_entry: ! 235.991 us | _raw_spin_lock_irqsave();

Latency without Load Balance?

● Hit a latency, and load balance wasn't called?

● PREEMPT_RT converts spinlocks to mutex– except for raw_spin_locks!

● trace-cmd start -p function_graph \

-l '*raw_spin_lock*'

entry: ! 235.991 us | _ra

Latency without Load Balance?

● Hit a latency, and load balance wasn't called?

● PREEMPT_RT converts spinlocks to mutex– except for raw_spin_locks!

● trace-cmd start -p function_graph \

-l '*raw_spin_lock*'

entry: ! 235.991 us | _ra

Graph vs Function Tracing

● graph gives you time of function held● function tracing can give you backtrace

– trace-cmd -p function -l 'raw_spin*' --func-stack

trace-cmd-8725 [002] 148276.692827: function: _raw_spin_lock_irq trace-cmd-8725 [002] 148276.692830: kernel_stack: <stack trace>=> __schedule (ffffffff8146d08f)=> schedule (ffffffff8146dd09)=> do_nanosleep (ffffffff8146c7ec)=> hrtimer_nanosleep (ffffffff8106eecb)=> sys_nanosleep (ffffffff8106f00e)=> system_call_fastpath (ffffffff81476692)

What to do?

● Keep function graph● Add events● All events added their own latencies

– Limit the events to trace

● trace-cmd start -p function_graph -l '*raw_spin_lock*' -e sched -e timer -e irq

Long story short

● Found latency● rq lock contention in pull_rt_tasks● 30 or more CPUs tried to take the same lock● Between cache line bouncing and locking

the bus, caused a large HW latency– but you can still blame SW

● Fixed by doing IPIs instead

Pull RT Tasks

CPU 0 CPU 1 CPU 2 CPU 40

cyclic testprio 90

cyclic testprio 90

cyclic testprio 90

cyclic testprio 90

Pull RT Tasks

CPU 0 CPU 1 CPU 2 CPU 40

cyclic testprio 90

cyclic testprio 90

watchdogprio 99

cyclic testprio 90

cyclic testprio 90

Pull RT Tasks

CPU 0 CPU 1 CPU 2 CPU 40

cyclic testprio 90

cyclic testprio 90

cyclic testprio 90

cyclic testprio 90

Pull RT Tasks

CPU 0 CPU 1 CPU 2 CPU 40

cyclic testprio 90

cyclic testprio 90

cyclic testprio 90

cyclic testprio 90

irq threadprio 50

Pull RT Tasks

CPU 0 CPU 1 CPU 2 CPU 40

<idle> <idle>cyclic test

prio 90 <idle>

irq threadprio 50

The Finding Nemo Seagull Effect!

Mine

Pull RT TasksThe Finding Nemo Seagull Effect

CPU 0 CPU 1 CPU 2 CPU 40

<idle> <idle>cyclic test

prio 90 <idle>

irq threadprio 50

Pull RT TasksThe Finding Nemo Seagull Effect

CPU 0 CPU 1 CPU 2 CPU 40

<idle> <idle>cyclic test

prio 90 <idle>

irq threadprio 50

Mine

Mine

Mine

Pull RT TasksIPI to push task

CPU 0 CPU 1 CPU 2 CPU 40

<idle> <idle>cyclic test

prio 90 <idle>

irq threadprio 50

Pull RT TasksIPI to push task

CPU 0 CPU 1 CPU 2 CPU 40

<idle> <idle>cyclic test

prio 90 <idle>

irq threadprio 50

IPI

IPI

IPI

Pull RT TasksIPI to push task

CPU 0 CPU 1 CPU 2 CPU 40

<idle>cyclic test

prio 90 <idle>irq thread

prio 50

The End?

● Looked like we found our bug!● Started verification process● Told everyone things will be verified shortly

Nope!

● Passed a 12 hour run● Failed a 24 hour run● Lets start again!

HW Lat Detector

● Hard ware latency detector● Runs periodic stop machine

– Define a period and run

– run != period● system will lock up

● Spins looking for latency

HW Lat Issue

While (now - start < perido) {tmp = timestamp();now = timestamp();diff = now - tmp;if (diff > thresh)

record();}

HW Lat Issue

While (now - start < perido) {tmp = timestamp();now = timestamp();diff = now - tmp;if (diff > thresh)

record();}

20%

80%

HW Lat Issue

Last = 0;While (now - start < perido) {

tmp = timestamp();now = timestamp();if (last) {

diff = tmp - last;if (diff > outer_thresh)

record_outer();}last = now;diff = now - tmp;if (diff > thresh)

record();}

HW Lat DetectorStop Machine

● Needs to run periodically– Will lock up the system otherwise

– Has chance to miss latency again!

● Changed to a thread– Thread takes up one of the CPUs

– Still needs to yield● Locks up machine otherwise

– But yield is much smaller than periodic● More likely to measure latency

HW Lat DetectorWorked!

● But not good enough● Vendor did not trust this code ???● Had to use their code

– Did somewhat the same thing

– In userspace

– Could easily miss latencies

trace-cmd

● trace-cmd start -p function_graph -l ‘raw_spin*’ -e all

● Modified cyclictest to use function graph– Still limited to raw_spin* locks

– Wont disable the events started

● cyclictest will still stop the trace on latency

trace-cmd ksoftirqd/33-216 [033] 55597.719935: timer_cancel: timer=0xffff88403f0ce520 ksoftirqd/33-216 [033] 55597.719935: timer_expire_entry: timer=0xffff88403f0ce520 function=delayed_work ksoftirqd/33-216 [033] 55597.719935: funcgraph_entry: 0.069 us | _raw_spin_lock_irqsave(); ksoftirqd/33-216 [033] 55597.719936: funcgraph_entry: 0.047 us | _raw_spin_lock(); ksoftirqd/33-216 [033] 55597.719936: sched_stat_sleep: comm=kworker/33:1 pid=1222 delay=132870067 ksoftirqd/33-216 [033] 55597.719937: sched_wakeup: kworker/33:1:1222 [120] success=1 CPU:033 ksoftirqd/33-216 [033] 55597.719937: timer_expire_exit: timer=0xffff88403f0ce520 ksoftirqd/33-216 [033] 55597.719942: timer_cancel: timer=0xffff88403f0ce620 ksoftirqd/33-216 [033] 55597.719942: timer_expire_entry: timer=0xffff88403f0ce620 function=delayed_work ksoftirqd/33-216 [033] 55597.719943: funcgraph_entry: 0.045 us | _raw_spin_lock_irqsave(); ksoftirqd/33-216 [033] 55597.719943: timer_expire_exit: timer=0xffff88403f0ce620 cyclictest-6110 [007] 55597.719955: funcgraph_entry: 0.194 us | _raw_spin_lock(); cyclictest-6110 [007] 55597.719956: funcgraph_entry: 0.175 us | _raw_spin_lock_irqsave(); cyclictest-6110 [007] 55597.719957: funcgraph_entry: 0.175 us | _raw_spin_lock_irqsave(); cyclictest-6113 [010] 55597.719957: funcgraph_entry: 2.436 us | _raw_spin_lock(); cyclictest-6110 [007] 55597.719957: funcgraph_entry: 0.203 us | _raw_spin_lock(); cyclictest-6110 [007] 55597.719958: sched_wakeup: cyclictest:6113 [4] success=1 CPU:010 cyclictest-6110 [007] 55597.719959: funcgraph_entry: 0.048 us | _raw_spin_lock_irqsave(); cyclictest-6113 [010] 55597.719960: funcgraph_entry: 0.170 us | _raw_spin_lock_irq(); cyclictest-6113 [010] 55597.719961: funcgraph_entry: 0.045 us | _raw_spin_lock_irqsave(); cyclictest-6113 [010] 55597.719962: funcgraph_entry: 0.043 us | _raw_spin_lock_irq(); cyclictest-6110 [007] 55597.719963: print: ffffffff810e5776 hit latency threshold (247 > 200)

trace-cmd

● trace-cmd report– Lots of information

– Detailed information

– Great to analyze

● TOO MUCH INFO!– Can not understand it all

– Hard to see the big picture

● KernelShark

kernelshark

kernelshark

kernelshark

kernelshark

kernelshark

kernelshark

kernelshark

kernelshark

Demo

Questions?

Questions?

Yeah Right?Like we have time.

Recommended