Ftrace How-to - version 0.4 - Latinoware · Ftrace How-to version 0.4 - Latinoware Red Hat Daniel...

Preview:

Citation preview

Ftrace How-toversion 0.4 - Latinoware

Red Hat

Daniel ’bristot’ de Oliveira

October 21, 2016

Who am I?

I am Daniel :-)

I am from BRAZIL!, But I’m Italian too..

It means 9 FIFA World CUP Championship o/

My father did not allow me to became a Truck Driver... So Istarted to study:

Bs. Computer Science 2009Ms. Automation Engineering 2014Ph.D Automation Engineering 2019

Before Red Hat: 5 years with embedded Linux

at Red Hat: SEG on SBR-Kernel: Real-time and performance

What is trace?

Run-time information

Hey... function foo, called bar...

Hey... function bar returned in 2 us

Hey... the code crossed here, and the var X is 10

Can be enabled/disabled in runtime

Low overhead... mainly when disabled (it is really important)

Generate a *lot* of data!

Dozen lines of trace per microsecond, per cpu!

Trace techniques

Static trace - Compiled in the code

Trace of functions - In the function calls

Dynamic trace - Added dynamically

Kernel tracing

Trace techniques

Static trace - tracepointsTrace of functions - ftraceDynamic trace - kprobes

Ftrace provides interface for these three techniques

Go!

Please, boot your RHEL7/Fedora VMs

Or run on your machine! it is safe :-)

Ftrace’s interface

Ftrace is embedded on kernel

Accessible via debugfs

echo to setcat to get

On Fedora and on RHEL7 it is mounted by default at:

/sys/kernel/debug/

On RHEL6:

mount -t debugfs debugfs /sys/kernel/debug/

Ftrace’s interface is at /sys/kernel/debug/tracing/

Ftrace’s interface

[root@btt-rhel7 ~]# cd /sys/kernel/debug/tracing/

[root@btt-rhel7 tracing]# ls

available_events max_graph_depth stack_trace_filter

available_filter_functions options trace

available_tracers per_cpu trace_clock

buffer_size_kb printk_formats trace_marker

buffer_total_size_kb README trace_options

current_tracer saved_cmdlines trace_pipe

dyn_ftrace_total_info set_event trace_stat

enabled_functions set_ftrace_filter tracing_cpumask

events set_ftrace_notrace tracing_max_latency

free_buffer set_ftrace_pid tracing_on

function_profile_enabled set_graph_function tracing_thresh

instances snapshot uprobe_events

kprobe_events stack_max_size uprobe_profile

kprobe_profile stack_trace

Starting from function tracer

Trace of kernel functions

Only kernel and only functions

Only kernel functions - no user-spaceNo macros and no inline functions

Basically: how does it work?

gcc -pg adds a call to mcount on begin of each functionmcount receives the address of the caller and the caller of callercalls* function tracer’s functionthat will save the information on the trace’s buffer

Default question: WOW so it means a lot overhead?

No: only a small when enabled, and ”nop” when disabled:

When disabled, all mcount calls are turned on nop.

This Steven’s lecture explains how it works:video.linux.com/videos/removing-stop-machine-from-the-tracing-infrastructure

Basic ftrace’s interface

available tracers

cat: show available tracers

current tracer

cat: show current tracerecho: set the current tracer

trace

cat: print the trace bufferecho: clean the trace buffer

tracing on

echo 1: turn the trace onecho 0: turn the trace off

Basic ftrace’s interface

[root@btt-rhel7 tracing]# cat available_tracers

blk function_graph wakeup_rt wakeup function nop

[root@btt-rhel7 tracing]# cat current_tracer

nop

[root@btt-rhel7 tracing]# cat trace

# tracer: nop

#

# entries-in-buffer/entries-written: 0/0 #P:4

#

# _-----=> irqs-off

# / _----=> need-resched

# | / _---=> hardirq/softirq

# || / _--=> preempt-depth

# ||| / delay

# TASK-PID CPU# |||| TIMESTAMP FUNCTION

# | | | |||| | |

Using function tracer

[root@btt-rhel7 tracing]# echo function > current_tracer

[root@btt-rhel7 tracing]# echo 1 > tracing_on

[root@btt-rhel7 tracing]# head -15 trace

# tracer: function

#

# entries-in-buffer/entries-written: 71715/71715 #P:4

#

# _-----=> irqs-off

# / _----=> need-resched

# | / _---=> hardirq/softirq

# || / _--=> preempt-depth

# ||| / delay

# TASK-PID CPU# |||| TIMESTAMP FUNCTION

# | | | |||| | |

bash-2274 [002] .... 2553.416814: mutex_unlock <-rb_simple_write

bash-2274 [002] .... 2553.416816: __fsnotify_parent <-vfs_write

bash-2274 [002] .... 2553.416817: fsnotify <-vfs_write

bash-2274 [002] .... 2553.416817: __srcu_read_lock <-fsnotify

Stopping the trace

[root@btt-rhel7 tracing]# echo 0 > tracing_on

[root@btt-rhel7 tracing]# echo nop > current_tracer

[root@btt-rhel7 tracing]# echo > trace

Graph tracer

It traces the call of functions

But also the return of functions

So, can I get the execution time of a function? YES!

But it have a cost: it is more expensive than function tracer

But not that much

Function graph tracer

[root@btt-rhel7 tracing]# echo function_graph > current_tracer

[root@btt-rhel7 tracing]# echo 1 > tracing_on

[root@btt-rhel7 tracing]# head -20 trace

# tracer: function_graph

#

# CPU DURATION FUNCTION CALLS

# | | | | | | |

3) | tick_do_update_jiffies64() {

3) 0.045 us | _raw_spin_lock();

3) | do_timer() {

3) | update_wall_time() {

3) 0.046 us | _raw_spin_lock_irqsave();

3) 0.047 us | _raw_spin_unlock_irqrestore();

3) 0.617 us | }

3) 0.040 us | calc_global_load();

3) 1.138 us | }

3) 0.042 us | _raw_spin_unlock();

3) 1.938 us | }

Jumping to Tracepoints

Points of trace on kernel’s code

Low overhead, mainly when disabled

Runs a callback to write on ftrace’s buffer

It is also known as trace events (e.g. on perf)

Organized by subsystems

subsystem:tracepoint name

Basic tracepoint’s interface

available events

cat: show available events

set event

cat: show enabled eventsecho: enable/clean events

Basic tracepoint’s interface

[root@btt-rhel7 tracing]# cat available_events | grep irq_handler

irq:irq_handler_exit

irq:irq_handler_entry

[root@btt-rhel7 tracing]# cat available_events | wc -l

1200

[root@btt-rhel7 tracing]# echo irq:irq_handler_exit > set_event

[root@btt-rhel7 tracing]# cat set_event

irq:irq_handler_exit

[root@btt-rhel7 tracing]# echo irq:irq_handler_entry >> set_event

[root@btt-rhel7 tracing]# cat available_events | grep sched_wakeup >> set_event

[root@btt-rhel7 tracing]# cat set_event

irq:irq_handler_exit

irq:irq_handler_entry

sched:sched_wakeup_new

sched:sched_wakeup

[root@btt-rhel7 tracing]# echo > set_event

[root@btt-rhel7 tracing]# cat set_event

Tracepoints output

[root@btt-rhel7 tracing]# cat available_events | grep irq_handler > set_event

[root@btt-rhel7 tracing]# head -20 trace

# tracer: nop

#

# entries-in-buffer/entries-written: 150/150 #P:4

#

# _-----=> irqs-off

# / _----=> need-resched

# | / _---=> hardirq/softirq

# || / _--=> preempt-depth

# ||| / delay

# TASK-PID CPU# |||| TIMESTAMP FUNCTION

# | | | |||| | |

<idle>-0 [001] d.h. 3623.817286: irq_handler_entry: irq=42 name=virtio0-input.0

<idle>-0 [001] d.h. 3623.817290: irq_handler_exit: irq=42 ret=handled

<idle>-0 [003] d.h. 3624.175584: irq_handler_entry: irq=14 name=ata_piix

<idle>-0 [003] d.h. 3624.175681: irq_handler_exit: irq=14 ret=handled

<idle>-0 [003] d.h. 3624.175689: irq_handler_entry: irq=14 name=ata_piix

<idle>-0 [003] d.h. 3624.175706: irq_handler_exit: irq=14 ret=handled

<idle>-0 [001] d.h. 3624.186418: irq_handler_entry: irq=42 name=virtio0-input.0

<idle>-0 [001] d.h. 3624.186421: irq_handler_exit: irq=42 ret=handled

<idle>-0 [001] d.h. 3625.264161: irq_handler_entry: irq=42 name=virtio0-input.0

Ftrace and tracepoints - together is better

<idle>-0 [002] .N.. 173.728450: schedule_preempt_disabled <-cpu_startup_entry

<idle>-0 [002] .N.. 173.728450: __schedule <-schedule_preempt_disabled

<idle>-0 [002] .N.. 173.728450: rcu_note_context_switch <-__schedule

<idle>-0 [002] .N.. 173.728450: _raw_spin_lock_irq <-__schedule

<idle>-0 [002] dN.. 173.728451: pre_schedule_idle <-__schedule

<idle>-0 [002] dN.. 173.728451: idle_exit_fair <-pre_schedule_idle

<idle>-0 [002] dN.. 173.728451: put_prev_task_idle <-__schedule

<idle>-0 [002] dN.. 173.728451: pick_next_task_fair <-__schedule

<idle>-0 [002] dN.. 173.728451: clear_buddies <-pick_next_task_fair

<idle>-0 [002] dN.. 173.728452: __dequeue_entity <-pick_next_task_fair

<idle>-0 [002] d... 173.728452: sched_switch: prev_comm=swapper/2 prev_pid=0

prev_prio=120 prev_state=R ==> next_comm=virt-what next_pid=2325 next_prio=120

grep-2325 [002] d... 173.728454: finish_task_switch <-__schedule

grep-2325 [002] .... 173.728455: __mmdrop <-finish_task_switch

grep-2325 [002] .... 173.728455: pgd_free <-__mmdrop

grep-2325 [002] .... 173.728455: _raw_spin_lock <-pgd_free

grep-2325 [002] .... 173.728455: _raw_spin_unlock <-pgd_free

grep-2325 [002] .... 173.728455: free_pages <-pgd_free

grep-2325 [002] .... 173.728456: free_pages.part.63 <-free_pages

grep-2325 [002] .... 173.728456: __free_pages <-free_pages.part.63

But it is too much information!

All the functions are too much!

It is possible to filter the trace of functions

And it is also possible to filter tracepoints based on its data.

let’s try it, starting by ftrace.

Ftrace’s filter interface

available filter functions

cat: show the functions that can be filtered

set ftrace filter

cat: show functions that will be tracedecho: enable/clean functions that will be traced

set ftrace notrace

cat: show functions that will NOT be tracedecho: enable/clean functions that will NOT be traced

set ftrace pid

cat: show the pid that will be tracedecho: set/clean the pid that will be traced

Filtering the trace of functions

[root@btt-rhel7 tracing]# cat available_filter_functions | wc -l

29428

[root@btt-rhel7 tracing]# echo mutex_lock > set_ftrace_filter

[root@btt-rhel7 tracing]# echo mutex_unlock >> set_ftrace_filter

[root@btt-rhel7 tracing]# cat set_ftrace_filter

mutex_unlock

mutex_lock

[root@btt-rhel7 tracing]# echo > set_ftrace_filter

[root@btt-rhel7 tracing]# cat set_ftrace_filter

#### all functions enabled ####

Filtering the trace of functions

[root@btt-rhel7 tracing]# echo mutex_lock mutex_unlock > set_ftrace_filter

[root@btt-rhel7 tracing]# echo function > current_tracer

[root@btt-rhel7 tracing]# echo 2294 > set_ftrace_pid

[root@btt-rhel7 tracing]# echo 1 > tracing_on

[root@btt-rhel7 tracing]# head -20 trace

# tracer: function

#

# entries-in-buffer/entries-written: 490/490 #P:4

#

# _-----=> irqs-off

# / _----=> need-resched

# | / _---=> hardirq/softirq

# || / _--=> preempt-depth

# ||| / delay

# TASK-PID CPU# |||| TIMESTAMP FUNCTION

# | | | |||| | |

bash-2294 [001] .... 111801.975119: mutex_unlock <-rb_simple_write

bash-2294 [001] .... 111801.975134: mutex_lock <-trace_array_put

bash-2294 [001] .... 111801.975135: mutex_unlock <-trace_array_put

bash-2294 [001] .... 111801.975360: mutex_lock <-n_tty_write

bash-2294 [001] .... 111801.975368: mutex_unlock <-n_tty_write

bash-2294 [001] .... 111801.975369: mutex_unlock <-tty_write_unlock

bash-2294 [001] .... 111801.975462: mutex_lock <-tty_ioctl

Filter and function graph tracer

[root@btt-rhel7 tracing]# echo function_graph > current_tracer

[root@btt-rhel7 tracing]# head -20 trace

# tracer: function_graph

#

# CPU DURATION FUNCTION CALLS

# | | | | | | |

2) 0.127 us | mutex_lock();

2) 0.112 us | mutex_unlock();

2) 0.055 us | mutex_unlock();

2) 0.127 us | mutex_lock();

2) 0.122 us | mutex_lock();

2) 0.129 us | mutex_lock();

2) 0.049 us | mutex_lock();

2) 0.050 us | mutex_unlock();

2) 0.206 us | mutex_lock();

2) 0.063 us | mutex_lock();

2) 0.081 us | mutex_unlock();

2) 0.063 us | mutex_unlock();

2) 0.054 us | mutex_lock();

2) 0.062 us | mutex_unlock();

2) 0.087 us | mutex_unlock();

2) 0.066 us | mutex_unlock();

Function filtering: wildcards and modules

[root@btt-rhel7 tracing]# echo mutex_* > set_ftrace_filter

[root@btt-rhel7 tracing]# cat set_ftrace_filter

mutex_spin_on_owner

mutex_unlock

mutex_lock

mutex_trylock

mutex_lock_interruptible

mutex_lock_killable

[root@btt-rhel7 tracing]# echo :mod:dm_mirror:* > set_ftrace_filter

[root@btt-rhel7 tracing]# head -10 set_ftrace_filter

mirror_iterate_devices [dm_mirror]

mirror_postsuspend [dm_mirror]

mirror_status [dm_mirror]

mirror_resume [dm_mirror]

fail_mirror [dm_mirror]

wakeup_mirrord [dm_mirror]

delayed_wake_fn [dm_mirror]

free_context [dm_mirror]

mirror_dtr [dm_mirror]

trigger_event [dm_mirror]

...

Function filtering: graph function

function graph: turn trace on in the call, and off on return

[root@btt-rhel7 tracing]# echo ttwu_do_wakeup > set_graph_function

[root@btt-rhel7 tracing]# echo function_graph > current_tracer

[root@btt-rhel7 tracing]# echo 1 > tracing_on

[root@btt-rhel7 tracing]# head -20 trace

# tracer: function_graph

#

# CPU DURATION FUNCTION CALLS

# | | | | | | |

3) | ttwu_do_wakeup() {

3) | check_preempt_curr() {

3) 0.077 us | resched_task();

3) 0.619 us | }

3) 1.066 us | }

1) | ttwu_do_wakeup() {

1) | check_preempt_curr() {

1) | check_preempt_wakeup() {

1) 0.078 us | update_curr();

1) 0.076 us | wakeup_gran.isra.54();

1) 1.175 us | }

1) 1.679 us | }

1) 2.159 us | }

Filtering tracepoints

There’s no need to filter which tracepoint - you alreadyfiltered it by choosing :-)

But you can filter at which conditions you want to print atracepoint, based on its fields.

Tracepoints are more than just *printks*

They are structured information

Basic tracepoint’s filtering interface

do you recall that events are classified by subsystems?

events options are on dir:

events/$SUBSYSTEM/$EVENT NAME

e.g.: events/irq/irq handler entry

inside each there are these files:

id: the ID of the eventenable: echo 1 to enable, 0 to disablefilter: get/set filter optionsformat: information about the data gathered by this tracepoint

Filtering tracepoints: without filter

[root@btt-rhel7 tracing]# cat available_events | grep irq:irq_

irq:irq_handler_exit

irq:irq_handler_entry

[root@btt-rhel7 tracing]# cat available_events | grep irq:irq_ > set_event

[root@btt-rhel7 tracing]# tail -10 trace

<idle>-0 [001] d.h. 1543.014323: irq_handler_entry: irq=43 name=virtio0-input.0

<idle>-0 [001] d.h. 1543.014328: irq_handler_exit: irq=43 ret=handled

<idle>-0 [001] d.h. 1543.015088: irq_handler_entry: irq=43 name=virtio0-input.0

<idle>-0 [001] d.h. 1543.015090: irq_handler_exit: irq=43 ret=handled

kworker/3:0-2299 [003] d.h. 1543.232015: irq_handler_entry: irq=14 name=ata_piix

kworker/3:0-2299 [003] d.h. 1543.232147: irq_handler_exit: irq=14 ret=handled

kworker/3:0-2299 [003] d.h. 1543.232158: irq_handler_entry: irq=14 name=ata_piix

kworker/3:0-2299 [003] d.h. 1543.232196: irq_handler_exit: irq=14 ret=handled

<idle>-0 [001] d.h. 1543.534487: irq_handler_entry: irq=43 name=virtio0-input.0

<idle>-0 [001] d.h. 1543.534492: irq_handler_exit: irq=43 ret=handled

Filtering tracepoints!

[root@btt-rhel7 tracing]# cd events/irq/irq_handler_entry/

[root@btt-rhel7 irq_handler_entry]# ls

enable filter format id

[root@btt-rhel7 irq_handler_entry]# cat format

name: irq_handler_entry

ID: 114

format:

field:unsigned short common_type; offset:0; size:2; signed:0;

field:unsigned char common_flags; offset:2; size:1; signed:0;

field:unsigned char common_preempt_count; offset:3; size:1; signed:0;

field:int common_pid; offset:4; size:4; signed:1;

field:int irq; offset:8; size:4; signed:1;

field:__data_loc char[] name; offset:12; size:4; signed:1;

print fmt: "irq=%d name=%s", REC->irq, __get_str(name)

[root@btt-rhel7 irq_handler_entry]# echo 'irq == 14' > filter

[root@btt-rhel7 irq_handler_entry]# cd ../irq_handler_exit/

[root@btt-rhel7 irq_handler_exit]# echo 'irq == 14' > filter

Filtered tracepoints!

[root@btt-rhel7 irq_handler_exit]# cd ../../../

[root@btt-rhel7 tracing]# tail -10 trace

kworker/3:0-2299 [003] d.h. 2305.087986: irq_handler_entry: irq=14 name=ata_piix

kworker/3:0-2299 [003] d.h. 2305.088010: irq_handler_exit: irq=14 ret=handled

kworker/3:0-2299 [003] d.h. 2307.135803: irq_handler_entry: irq=14 name=ata_piix

kworker/3:0-2299 [003] d.h. 2307.135852: irq_handler_exit: irq=14 ret=handled

kworker/3:0-2299 [003] d.h. 2307.135858: irq_handler_entry: irq=14 name=ata_piix

kworker/3:0-2299 [003] d.h. 2307.135873: irq_handler_exit: irq=14 ret=handled

kworker/3:0-2299 [003] d.h. 2309.183882: irq_handler_entry: irq=14 name=ata_piix

kworker/3:0-2299 [003] d.h. 2309.183966: irq_handler_exit: irq=14 ret=handled

kworker/3:0-2299 [003] d.h. 2309.183973: irq_handler_entry: irq=14 name=ata_piix

kworker/3:0-2299 [003] d.h. 2309.183998: irq_handler_exit: irq=14 ret=handled

A more complex filter!

[root@btt-rhel7 tracing]# cd events/sched/sched_wakeup

[root@btt-rhel7 sched_wakeup]# cat format

name: sched_wakeup

ID: 311

format:

field:unsigned short common_type; offset:0; size:2; signed:0;

field:unsigned char common_flags; offset:2; size:1; signed:0;

field:unsigned char common_preempt_count; offset:3; size:1; signed:0;

field:int common_pid; offset:4; size:4; signed:1;

field:char comm[32]; offset:8; size:16; signed:1;

field:pid_t pid; offset:24; size:4; signed:1;

field:int prio; offset:28; size:4; signed:1;

field:int success; offset:32; size:4; signed:1;

field:int target_cpu; offset:36; size:4; signed:1;

print fmt: "comm=%s pid=%d prio=%d success=%d target_cpu=%03d",

REC->comm, REC->pid, REC->prio, REC->success, REC->target_cpu

[root@btt-rhel7 sched_wakeup]# echo "prio < 100" > filter

[root@btt-rhel7 sched_wakeup]# echo 1 > enable

Let’s put more fun on it!

[root@btt-rhel7 sched_wakeup]# cd ../sched_switch/

[root@btt-rhel7 sched_switch]# cat format

name: sched_switch

ID: 309

format:

field:unsigned short common_type; offset:0; size:2; signed:0;

field:unsigned char common_flags; offset:2; size:1; signed:0;

field:unsigned char common_preempt_count; offset:3; size:1; signed:0;

field:int common_pid; offset:4; size:4; signed:1;

field:char prev_comm[32]; offset:8; size:16; signed:1;

field:pid_t prev_pid; offset:24; size:4; signed:1;

field:int prev_prio; offset:28; size:4; signed:1;

field:long prev_state; offset:32; size:8; signed:1;

field:char next_comm[32]; offset:40; size:16; signed:1;

field:pid_t next_pid; offset:56; size:4; signed:1;

field:int next_prio; offset:60; size:4; signed:1;

print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==>

next_comm=%s next_pid=%d next_prio=%d", REC->prev_comm, REC->prev_pid, REC->prev_prio, REC->prev_state & (1024-1) ? __print_flags(REC->prev_state &

(1024-1), "|", { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x" },

{ 128, "K" }, { 256, "W" }, { 512, "P" }) : "R", REC->prev_state & 1024 ? "+" : "", REC->next_comm,

REC->next_pid, REC->next_prio

[root@btt-rhel7 sched_switch]# echo "(prev_state == 1 && prev_prio < 100) || next_prio < 100 " > filter

[root@btt-rhel7 sched_switch]# echo 1 > enable

Lets put fun on it!

[root@btt-rhel7 sched_switch]# cd ../../../

[root@btt-rhel7 tracing]# tail -9 trace

<idle>-0 [001] dNh. 6155.077138: sched_wakeup: comm=watchdog/1 pid=19 prio=0 success=1

target_cpu=001

<idle>-0 [001] d... 6155.077165: sched_switch: prev_comm=swapper/1 prev_pid=0

prev_prio=120 prev_state=R ==> next_comm=watchdog/1

next_pid=19 next_prio=0

watchdog/1-19 [001] d... 6155.077181: sched_switch: prev_comm=watchdog/1 prev_pid=19

prev_prio=0 prev_state=S ==>

next_comm=swapper/1 next_pid=0 next_prio=120

<idle>-0 [002] dNh. 6155.089144: sched_wakeup: comm=watchdog/2 pid=24 prio=0 success=1

target_cpu=002

<idle>-0 [002] d... 6155.089166: sched_switch: prev_comm=swapper/2 prev_pid=0

prev_prio=120 prev_state=R ==> next_comm=watchdog/2

next_pid=24 next_prio=0

watchdog/2-24 [002] d... 6155.089181: sched_switch: prev_comm=watchdog/2 prev_pid=24

prev_prio=0 prev_state=S ==> next_comm=swapper/2

next_pid=0 next_prio=120

<idle>-0 [003] dNh. 6155.101158: sched_wakeup: comm=watchdog/3 pid=29 prio=0 success=1

target_cpu=003

<idle>-0 [003] d... 6155.101176: sched_switch: prev_comm=swapper/3 prev_pid=0

prev_prio=120 prev_state=R ==> next_comm=watchdog/3

next_pid=29 next_prio=0

watchdog/3-29 [003] d... 6155.101189: sched_switch: prev_comm=watchdog/3 prev_pid=29

prev_prio=0 prev_state=S ==> next_comm=swapper/3

next_pid=0 next_prio=120

Ah! percpu trace! and trace pipe! and buffersize!

That is simple! and useful!

Each CPU have a dir in the per cpu/ dir

For example, for CPU 2: per cpu/cpu2/

Each CPU has its own trace at: per cpu/cpuX/trace

Trace pipe: run a cat per cpu/cpuX/trace pipe

It is also available for all CPUs

The size of the trace is defined per cpu on file buffer size kb

Triggering

Ok, it is nice to filter, but sometimes we need more!

I want to start the trace after the occurrence of an event

and I want to stop the trace after another event happens!

or I want to enable an event after the call of a function

or yet I want to get the stacktrace in the occurrence of atracepoint

ok, let’s try it!

Triggering on function trace

The interface for triggering is the filter file: set ftrace filter

echo ’function:action:times’ > set ftrace filter to set

echo ’ !function:action:times’ > set ftrace filter to clear

Let’s start by turning the tracing on and off

Triggering trace on and off - from a function

[root@btt-rhel7 tracing]# echo 0 > tracing_on

[root@btt-rhel7 tracing]# echo irq_exit:traceoff:5 irq_enter:traceon:5 > set_ftrace_filter

[root@btt-rhel7 tracing]# echo function > current_tracer

[root@btt-rhel7 tracing]# cat trace

# tracer: function

#

# entries-in-buffer/entries-written: 70/70 #P:4

#

# _-----=> irqs-off

# / _----=> need-resched

# | / _---=> hardirq/softirq

# || / _--=> preempt-depth

# ||| / delay

# TASK-PID CPU# |||| TIMESTAMP FUNCTION

# | | | |||| | |

bash-2372 [001] d... 2344.896123: irq_enter <-smp_apic_timer_interrupt

bash-2372 [001] d... 2344.896124: rcu_irq_enter <-irq_enter

bash-2372 [001] d.h. 2344.896124: exit_idle <-smp_apic_timer_interrupt

bash-2372 [001] d.h. 2344.896124: local_apic_timer_interrupt <-smp_apic_timer_interrupt

bash-2372 [001] d.h. 2344.896125: hrtimer_interrupt <-local_apic_timer_interrupt

bash-2372 [001] d.h. 2344.896125: _raw_spin_lock <-hrtimer_interrupt

[...]

bash-2372 [001] d.h. 2344.896132: _raw_spin_unlock <-hrtimer_interrupt

bash-2372 [001] d.h. 2344.896132: tick_program_event <-hrtimer_interrupt

bash-2372 [001] d.h. 2344.896132: clockevents_program_event <-tick_program_event

bash-2372 [001] d.h. 2344.896132: ktime_get <-clockevents_program_event

bash-2372 [001] d.h. 2344.896132: lapic_next_deadline <-clockevents_program_event

Triggering events on and off - from a function

[root@btt-rhel7 tracing]# echo 'irq_exit:disable_event:sched:sched_wakeup' > set_ftrace_filter

[root@btt-rhel7 tracing]# echo 'irq_enter:enable_event:sched:sched_wakeup' > set_ftrace_filter

[root@btt-rhel7 tracing]# head -20 trace

# tracer: nop

#

# entries-in-buffer/entries-written: 467/15199 #P:4

#

# _-----=> irqs-off

# / _----=> need-resched

# | / _---=> hardirq/softirq

# || / _--=> preempt-depth

# ||| / delay

# TASK-PID CPU# |||| TIMESTAMP FUNCTION

# | | | |||| | |

<idle>-0 [003] dNh. 5605.176671: sched_wakeup:

comm=watchdog/3 pid=29 prio=0 success=1 target_cpu=003

<idle>-0 [003] dNh. 5605.996414: sched_wakeup:

comm=rcu_sched pid=13 prio=120 success=1 target_cpu=003

<idle>-0 [003] dNh. 5609.176670: sched_wakeup:

comm=watchdog/3 pid=29 prio=0 success=1 target_cpu=003

<idle>-0 [003] dNh. 5613.176671: sched_wakeup:

comm=watchdog/3 pid=29 prio=0 success=1 target_cpu=003

<idle>-0 [003] dNh. 5613.890256: sched_wakeup:

comm=rcu_sched pid=13 prio=120 success=1 target_cpu=003

<idle>-0 [003] dNh. 5615.996420: sched_wakeup:

comm=rcu_sched pid=13 prio=120 success=1 target_cpu=003

<idle>-0 [003] dNh. 5617.176668: sched_wakeup:

comm=watchdog/3 pid=29 prio=0 success=1 target_cpu=003

Triggering on events

Interface on ”filter” file of the event dir

Format:

# echo '[!]command[:count] [if filter]' > trigger

Commands:

enable event/disable eventtraceon/traceoffsnapshotstacktrace

not available on RHEL7 :-( (yet?)

Triggering events on and off - from a function

[root@kiron bristot]# cd /sys/kernel/debug/tracing/events/sched/sched_wakeup

[root@kiron sched_wakeup]# ls

enable filter format id trigger

[root@kiron sched_wakeup]# echo 'stacktrace:10 if prio < 100' > trigger

[root@kiron sched_wakeup]# cat ../../../trace

<idle>-0 [003] dNh. 7762.589836: <stack trace>

=> ftrace_raw_event_sched_wakeup_template

=> ttwu_do_wakeup

=> ttwu_do_activate.constprop.90

=> try_to_wake_up

=> wake_up_process

=> hrtimer_wakeup

=> __run_hrtimer

=> hrtimer_interrupt

=> local_apic_timer_interrupt

=> smp_apic_timer_interrupt

=> apic_timer_interrupt

=> cpuidle_enter

=> cpu_startup_entry

=> start_secondary

pulseaudio-2972 [003] dN.. 7762.590148: <stack trace>

trace-cmd

A command line tool for ftrace

It is useful to collect data on customers

If you know how-to use ftrace, you know how to use thetrace-cmd

Tip: ftrace and vmcore

It is possible to extract the trace from a vmcore!It helps to understand what happened before the crashcrash> extend /usr/lib64/crash/extensions/trace.so

crash> trace dump -t data.dat

crash> pwd

/cores/retrace/tasks/968181176/misc

crash> ls

bt-a bt-filter data.dat dwysocha-automated-analysis.txt

dwysocha-rhst-search-rip-string.txt retrace-log run_crash

sys sys-c

More info

LWN -> Kernel -> Kernel Index -> Kernel Tracing

Kernel Documentation: Documentation/trace/

ping bristot@sbr-kernel

Part IUnderstanding the Linux kernelexecution model

Operating system: What books say it is:

IMHO: Netherlands’ Flag!

Before starting...

Let’s redefine hardware

Another point of view of Hardware

Another point of view of Hardware

And we fit the kernel here:

And protect it

And the kernel runs...

How does the kernel run?

There are two ways to run kernel’s code

Or ‘calling the kernel’

Or by running a kernel thread

Calling the kernel

We can think on kernel as a library of functions that areactivated to serve an event

These events are either generate:

by the Hardware, orby the Software.

How does the kernel receives these events?

Via interruptions

What is an interruption?

Interrupts are events that indicate that a condition existssomewhere in the system, the processor, or within the currently

executing program or task that requires the attention of aprocessor. They typically result in a forced transfer of executionfrom the currently running program or task to a special softwareroutine or task called an interrupt handler - Intel 64 and IA-32

Architectures Software Developer’s Manual.

Type of interruptions

Hardware Activated:

Asynchronous

Software Activated:

SynchronousExceptions:

Faults: Correctable; offending instruction is retriedTraps: Often for debugging; instruction is not retriedAborts: Not Correctable; Severe errors!

Software Interrupts:

System Calls!

A Hardware Interruption

Hardware Interrupt: Another point of view

How about process?

What is a process?

A process is a virtual memory context

Running on a protected ring

Where the threads run

Process and Threads

A process is a ‘virtual’ environment where threads have its:

codedatastackresources: e.g. sockets, file descriptors, and so on.

And they run: Threads are scheduled to run on a processor

There’s no ‘Software layer between the thread and processor’Ok, that flag, I mean, diagram fits to java :P

But sometimes a thread need more resources...

These resources are managed by the kernel

So: threads run with Operating System Support!Not ON the Operating System.

Hey kernel! I need a resource!

How does a thread ask a resource to the kernel?

Hey kernel! I need a resource!

It runs the kernel :-)...

Threads running on kernel space

A thread can run kernel code on kernel-space

And we say that the kernel runs on behalf thread

Each thread has a stack in kernel context

How does a thread jumps to ‘kernel context on ring 0’?

Generating a software interrupt o/

Thread running on kernel: system call

Thread running on kernel: or via exceptions

Another point of view:

Kernel threads

Are threads that run on kernel address space.

They are like regular threads - But only run on kernel space.

Finally, the kernel threads:

So, we have the following ways to run kernel’scode

IRQ - Hardware activated

Soft IRQ

Process threads:

Via system callVia exceptions

Kernel threads:

Runs only on kernel-space

It explains how! but not when!

How does the system decides to run a IRQ or a thread?

hardware IRQs

They are asynchronousKernel can’t control when they will run

They start running: They are not scheduled to run!

But it can control if they can be activated

Only maskable interrupts

They run until finish: kernel can’t put it to sleep:

But they can suffer interference of another IRQAnd they can block on spinlocks

IRQ running

Threads

They are activated in the kernel context

sched wakeup

Because they go to sleep in the kernel context

Mainly via system callMost common states

S - InterruptibleD - Uninterruptible

R is the Runable state

But it does not mean that they are running!

Another thread can be running at the timeAnd the thread is waiting to be scheduled...

States of a thread

Thread sleeping/waking up

Schedulers

Real-Time Dynamic priority: DEADLINEEach task has a:

PeriodExecution Time - or budgetDeadline

Closer deadline - higher priority

Real-Time fixed priority: FIFO/RREach task has a fixed priority of 99 possible:

User-space: 1 < 99Kernel-space: 0 > 98

Higher priority thread runsTasks with same prio:

FIFO: Each task will run until finishRR: Tasks will share CPU time on a Round-Robin Fashion

Schedulers

Fair Scheduler: OTHER

Will provide the same amount of CPU time to each runnabletask in a period.Less nice task will receive more time in a period.

This nice is internally mapped to a priorityKernel-Space: 99 > 139.

IDLE: Waits on kernel

Scheduling

Is there a Deadline task ready?

Get the one with the closer deadline

Is there a RT task ready?

Get the one with higher priority

Is there a Fair task ready?

Get the next to run in the fair fashion

Enter on idle state.

Scheduling

Conclusion

Applications do not run on the OS - It run on the hardware

OS is responsible to provide the environment and resources

The kernel is activated by interrupts.

From hardware, andFrom software.

Threads run on kernel-space

Threads sleep on kernel space

and the kernel schedules the threads

The end.Thanks for listening.

Recommended