ΕΛΠ 605: Προχωρηµένη Αρχιτεκτονική Υπολογιστών Εργαστήριο ... · Πέτρος Παναγή Σελ. 10 perf: Linux profiling with performance

Σελ. 1Πέτρος Παναγή

ΕΛΠ 605: Προχωρηµένη

Αρχιτεκτονική Υπολογιστών

Εργαστήριο Αρ. 4

Linux Monitoring Utilities

(perf,top,mpstat ps, free) and gdb

dissasembler, gnuplotLecturer: Zacharias Hadjilambrou

Σελ. 2

top

Realtime monitoring of:

CPU and memory utilization for each process

Total CPU utilization – average and per core

Total Memory utilization (used and free)

Useful Command switches:

top –d 1 #set the update interval to 1 second //the default is 3

seconds

top –b #run in batch mode, top will run until killed, useful for

saving top output in a file

top –H # instruct top to show individual threads

top –u username # show processes of a specific user only

Zacharias Hadjilambrou

Σελ. 3

top• 103ws1:/home/research/zhadji01/EPL605labs>taskset -c 0 ./matrix_serial_ver1 &> /dev/null &

• 103ws1:/home/research/zhadji01/EPL605labs>top

• CPU utilization explanation: us (user time) sy (system time) ni (processes that run at higher priority)

id (idle time) wa (cpu waiting for I/O), hi si (hardware and software interrupts handling)

• User zhadji01 is running matrix_serial_v and xioann02 runs firefox, total main memory is 8GB

(7933440KB)

• matrix_serial_v consumes 99.7% CPU time and 0.1% of total memory

• Average CPU utilization is 27.3% and the CPU has four cores this means that ~one core is

fully utilized

• Press key 1 to view CPU utilization per core

• Indeed Core0 is fully utilized at 100%, matrix_serial_v runs at core0 as instructed by taskset


Σελ. 4

top –H to view threads of multithreaded programs

• 103ws1:/home/research/zhadji01/EPL605labs>./simpleParallelProgram &

• 103ws1:/home/research/zhadji01/EPL605labs>top

400% CPU utilization means it uses 4 cores

top –H to view threads

103ws1:/home/research/zhadji01/EPL605labs>top –H

Each thread has ~100% cpu utilization meaning it utilizes fully one core


Σελ. 5

top useful keys in interactive mode

• Press 1 to view per core utilization

• Press Shift+p to sort process from higher CPU utilization to lower

• Press d to change the update delay

• Press u to view specific user


Σελ. 6

Htop

htop (https://linux.die.net/man/1/htop)

an interactive system-monitor process-viewer

Πέτρος Παναγή

https://linux.die.net/man/1/htop

Σελ. 7

ps command• Gives a snapshot of all processes

• 103ws1:/home/research/zhadji01/EPL605labs>ps aux

• 103ws1:/home/research/zhadji01/EPL605labs>ps –ef

• ps -eLF # information about threads


Σελ. 8

mpstat

A good tool to view CPU utilization

mpstat -P ALL 2 1

# -P ALL show all cores

# 2 1 show two reports with one second interval between

them


Σελ. 9

free

Tool to view memory utilization

free –g # -g to view in gigabyte


Σελ. 10Πέτρος Παναγή

perf: Linux profiling with performance counters

Performance counters are CPU hardware registers that

count hardware events such as instructions executed,

cache-misses suffered, or branches mispredicted.

perf provides rich generalized abstractions over hardware

specific capabilities. Among others, it provides per task,

per CPU and per-workload counters, sampling on top of

these and source code event annotation. Perf gives you

visibility where the Hotspots of your program are.

https://perf.wiki.kernel.org/index.php/Main_Page

https://perf.wiki.kernel.org/index.php/Main_Page

Σελ. 11

Intel Core Performance Monitor Unit (PMU)

Limited number of hardware counters (8 counters per core on the above example)

Time multiplexing is performed when selected events > hardware counters. An estimation of

actual account is given

e.g. user wants to measure instructions and cycles but only one counter is available, perf will

measure half of the time the instructions and half of the time the cycles. The measured

instructions and cycles will be multiplied by 2 to give an estimation of the actual total

instructions and cycles


Σελ. 12

Perf Events>perf listList of pre-defined events (to be used in -e):

cpu-cycles OR cycles [Hardware event]

instructions [Hardware event]

cache-references [Hardware event]

cache-misses [Hardware event]

branch-instructions OR branches [Hardware event]

branch-misses [Hardware event]

bus-cycles [Hardware event]

stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]

stalled-cycles-backend OR idle-cycles-backend [Hardware event]

ref-cycles [Hardware event]

cpu-clock [Software event]

task-clock [Software event]

page-faults OR faults [Software event]

context-switches OR cs [Software event]

cpu-migrations OR migrations [Software event]

minor-faults [Software event]

major-faults [Software event]

alignment-faults [Software event]

emulation-faults [Software event]

L1-dcache-loads [Hardware cache event]

L1-dcache-load-misses [Hardware cache event]

L1-dcache-stores [Hardware cache event]

L1-dcache-store-misses [Hardware cache event]

L1-dcache-prefetches [Hardware cache event]

...

http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html


https://perf.wiki.kernel.org/index.php/Tutorial

http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html


Σελ. 13Zacharias Hadjilambrou

Measuring multiple eventsperf stat -e instructions,cycles matrix_serial_ver1

Performance counter stats for './matrix_serial_ver1':

34,058,795,490 instructions # 2.67 insn per cycle

12,762,609,265 cycles

3.487285969 seconds time elapsed

To measure more than one event, after -e provide a comma-separated list :

perf stat -e cycles,instructions,cache-misses ./matrix_serial_ver1

To save the output to a file use –o switch

perf stat –o tmp -e cycles,instructions,cache-misses ./matrix_serial_ver1

cat tmp



12,762,609,265 cycles


Σελ. 14

Attach to already running processTo attach to running process –p (-t to attach to thread)

./matrix_serial_ver1 &

perf stat -e instructions,cycles -p $! & ##$! Is the Pid of last launced process

or do

perf stat -e instructions,cycles –p `pgrep matrix_serial` &

pkill -SIGINT perf # send signal interrupt to perf to make perf print statistics



12,762,609,265 cycles


perf stat -e instructions,cycles -p $! sleep 1 # perf will run only for 1 second

in this case pkill –SIGINT is

not required





Select event level (user, kernel etc)


System wide collectionxg3:/home/root_desktop>./run_NPB.sh ./NPB3.3/NPB3.3-OMP/bin/ sp.C.x 32 &> /dev/null &

## we started a multithreaded workload that uses 32 cores, each core runs at 3GHz

##To sum the instructions and cycles executed by all cores

perf stat -e instructions,cycles -a sleep 1

Performance counter stats for 'system wide':


95,655,967,640 cycles //Each core at 3GHz should have 3Billion cycles in 1seconds, multiply by 32 ~96Billion


0.36 instructions per cycle (IPC) is indicative of single thread performance

To measure the actual IPC of all cores do

perf stat -e instructions,cycles -A -C 0-31 sleep 1

-A disables statistics aggregation –C defines which core statistics to print


Per core stats

The actual CPU IPC is ~0.33 * 32 =10.56

Per core and system wide collection required root access, or administrator allowing global

perf collections (set /proc/sys/kernel/perf_event_paranoid to -1)

Σελ. 18

Perf top


Shows realtime the most hot function

Σελ. 19

Matrix Multiplication Examples


Σελ. 20



Σελ. 21



Σελ. 22

gcc Optimizations and Branch Prediction


Σελ. 23

gcc Optimizations and Branch Prediction


https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

Σελ. 24

Perf annotate


Σελ. 25

>perf annotate


Σελ. 26

Assembly>gcc –S main.c

>vi main.s

>gcc main.s -o main.s.out


Σελ. 27

>objdump –d main.out

>hexdump –C main.out


Σελ. 28

gdbgcc -ggdb main.c -o main.out

bash

gdb main.out

b main (breakpoint at main)

r (run)

disassemble

s

s (step)

bt (Backtrace)

set disassemble-next-line on


Σελ. 29

gdb Disassemble next linesetenv SHELL /bin/bash

gdb a.out

GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)

..

(gdb) b main

(gdb) set disassemble-next-line on (gdb) s (gdb) r

Starting program: /home/faculty/petrosp/EPL370/FALL2012/LABS/EPL370Lab5/a.out

Breakpoint 1, main () at gprof-prog1.c:49

49 init( x );

=> 0x00000000004005bc <main+11>: 48 8d 85 60 fe ff ff lea -0x1a0(%rbp),%rax

0x00000000004005c3 <main+18>: 48 89 c7 mov %rax,%rdi

0x00000000004005c6 <main+21>: e8 f9 fe ff ff callq 0x4004c4 <init>

Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.4.x86_64

(gdb) s

init (x=0x7fffffffdf90) at gprof-prog1.c:8

8 for ( i = 0; i < 100; i++ ) {

=> 0x00000000004004cc <init+8>: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)

0x00000000004004d3 <init+15>: eb 16 jmp 0x4004eb <init+39>

(gdb)


Σελ. 30

gcc -ggdb main.c -o main.out

bash

gdb main.out

b main

r

layout asm

ctrl-x 2

s


Σελ. 31

GNUPlothttp://www.gnuplot.info/

http://www.gnuplot.info/documentation.html

(log in to cs6472 or any other machine that has gnuplot, and run gnuplot)

plot "data.txt" using 1:2 title 'Column 2', "data.txt" using 1:3 title 'Column 3'

plot "data.txt" using 1:2 title 'Column 2'

gnuplot> set term png (will produce .png output)

gnuplot> set output "printme.png" (output to any filename you use)

gnuplot> replot




Documents

ΕΛΠ 605: Προχωρηµένη Αρχιτεκτονική Υπολογιστών Εργαστήριο ... · Πέτρος Παναγή Σελ. 10 perf: Linux profiling with performance