Upload
anne-nicolas
View
514
Download
0
Embed Size (px)
DESCRIPTION
https://distro-recipes.org
Citation preview
Distro Recipes 4th April 2013 @Paris
Credit : fras1977@flickr
My ${favorite} Linux Distribution is slow !
Performance does matter
● Users expects more performance
● They do have perfect hardware● They installed the latest OS release
● So it shall be faster than ever ! Isn't it ?
● But we still got thoses imprecise reports .....
« Hey ! My Linux Distro is Slow ! »
« The latest OS reduces the performance ! »
About this talk
● What to expect ?
– Tricks to proove distro is not always the bad guy– A compilation of real debugging sessions
● What not to expect ?
– Having one magic answer about perf.
● Who are you ?
Tracking the beast
● Slowdowns come from various sources
– CPU– Storage– Interrupts– Memory– Network (not included in this presentation)– Applications (not included)
CPU load
● Estimating the load of the CPU is pretty easy
● Using « top » with a sort on « cpu load »
– Don't mixup with loadavg !
Weird CPU issues
● Temperature
– Internal throttling to avoid overheat– ~110/120° on Intel CPUs– Monitoring via coretemp & acpi
« CPU1: Core temperature above threshold, cpu clock throttled (total events = 12841) »
– Generates Machine Check Exceptions (MCE)– As a result, CPU performance are reduced
Storage Load
● Massive IOs can slow down a system seriously
– Depending on the storage device (HDD vs SSD)
– Depending on the IO profile (sequential vs random)
– « vmstat » is useful to track this behavior
bi = blocks inbo = blocks outwa = waiting IOsi = swap inso = swap out
Someone reads a lot !
Storage Load
bi = blocks inbo = blocks outwa = waiting IOsi = swap inso = swap out
Someone try to read a lot !(3 threads read 4K random)
● CPU does wait the storage device (~30% wa)
● HDD + 3 threads @ 4K random generates a massive device load
● During this load, my system was unusable
● A desktop search, rsync, tar, ... can generate such load
Storage Load
● A broken/slow storage device can load the system
● HDD : Broken sectors reallocation are invisible but lags
● SATA disks tries several time to recover sectors ● No other IOs will be accepted during this process● Kills RAID-arrays● Enterprise-class SATA disks reallocates immediately
● SMART to count {broken|pending|reallocated} sectors● %wa in top or vmstat shall be high in such case
Storage Load
● « smartctl -a /dev/sda » of a dying HDD disk
Storage Load
● SSDs : Far from a perfect device
● Performance may vary regarding various fw implementations● SLC front cache before reaching the MLC storage
– Getting out-of-cache effect– 200+MB/s on SLC– 5MB/s on MLC in worst case– After a while, global SSD performance is limited : 5MB/sec– Behavior not visible for {simple|short} workload
– %wa in top or vmstat shall increase in such case– Can be reproduced by using fiohttp://git.kernel.dk/?p=fio.git
SSD IO Path
MLCCells
SLCCache
IOController
SATA IOs6Gb/sec
960Mb/sec
40Mb/sec
Weird Storage Issues
● Temperature
– On HDDs, thermal recalibration occurs too often to maintain a certain level of service.
– Media-class disks are less subject to this effect
● Vibrations
– Raid arrays contains several HDDs spinning constantly
– All this individual vibrations prevent heads being properly aligned leading to heads' recalibrations
– That could totally prevent a raid array from delivering IOs
IRQ Storms
● Inside a +1200 array of identical computers
● Some are booting very very slowly and engage some software watchdogs
● /proc/interrupts reports IRQ storm (66000 per sec) on interrupt 19
● CPU is permanently interrupted by IRQs
● AHCI controller floods as HDD doesn't answer on ATA_IDENTIFY requests (seen by extracting HDD)
● AHCI driver fails at probing so int19 only reports usb dev.
● Some hardware failures can lead to load issues
IRQ Storms
Memory Issues
● 2 identical servers that doesn't perform the same
– One is really slower than the other
● Same server brand / model● Same vendor
● Same hardware setup
● But really performs differently....
● What the hell my {application|os} is doing wrong here ?
Memory Issues
● Memory banks were not populated with the same HW
● Some were DDR3 with a CAS Latency = 9
● Some were DDR3 with a CAS Latency = 11
● As a result the memory access were slower on one
● This got detected at runtime under Linux with DDR3 timing tool from Cyring. (http://code.cyring.fr/FTS/?PATH=Source/C/DDR3_Timings/0.2/timings.c)
● Hardware setups were supposed to be the same !
Dear Loadavg,
● You are complicated to understand
● You don't help tracking the source of the load
● You can be a lier if some kernel code don't update you
● But you provide an indicator on the global load
– 1.0 means 100% of the ressources
● I'll keep you as a raw indicator to start my investigations