Distro Recipes 2013 : My ${favorite_linux_distro} is slow!

$Page 1: Distro Recipes 2013 : My ${favorite_linux_distro} is slow!$
Distro Recipes 4th April 2013 @Paris

Credit : fras1977@flickr

My ${favorite} Linux Distribution is slow !

Performance does matter

● Users expects more performance

● They do have perfect hardware● They installed the latest OS release

● So it shall be faster than ever ! Isn't it ?

● But we still got thoses imprecise reports .....

« Hey ! My Linux Distro is Slow ! »

« The latest OS reduces the performance ! »

About this talk

● What to expect ?

– Tricks to proove distro is not always the bad guy– A compilation of real debugging sessions

● What not to expect ?

– Having one magic answer about perf.

● Who are you ?

Tracking the beast

● Slowdowns come from various sources

– CPU– Storage– Interrupts– Memory– Network (not included in this presentation)– Applications (not included)

CPU load

● Estimating the load of the CPU is pretty easy

● Using « top » with a sort on « cpu load »

– Don't mixup with loadavg !

Weird CPU issues

● Temperature

– Internal throttling to avoid overheat– ~110/120° on Intel CPUs– Monitoring via coretemp & acpi

« CPU1: Core temperature above threshold, cpu clock throttled (total events = 12841) »

– Generates Machine Check Exceptions (MCE)– As a result, CPU performance are reduced

Storage Load

● Massive IOs can slow down a system seriously

– Depending on the storage device (HDD vs SSD)

– Depending on the IO profile (sequential vs random)

– « vmstat » is useful to track this behavior

bi = blocks inbo = blocks outwa = waiting IOsi = swap inso = swap out

Someone reads a lot !

Storage Load

bi = blocks inbo = blocks outwa = waiting IOsi = swap inso = swap out

Someone try to read a lot !(3 threads read 4K random)

● CPU does wait the storage device (~30% wa)

● HDD + 3 threads @ 4K random generates a massive device load

● During this load, my system was unusable

● A desktop search, rsync, tar, ... can generate such load

Storage Load

● A broken/slow storage device can load the system

● HDD : Broken sectors reallocation are invisible but lags

● SATA disks tries several time to recover sectors ● No other IOs will be accepted during this process● Kills RAID-arrays● Enterprise-class SATA disks reallocates immediately

● SMART to count {broken|pending|reallocated} sectors● %wa in top or vmstat shall be high in such case

Storage Load

● « smartctl -a /dev/sda » of a dying HDD disk

Storage Load

● SSDs : Far from a perfect device

● Performance may vary regarding various fw implementations● SLC front cache before reaching the MLC storage

– Getting out-of-cache effect– 200+MB/s on SLC– 5MB/s on MLC in worst case– After a while, global SSD performance is limited : 5MB/sec– Behavior not visible for {simple|short} workload

– %wa in top or vmstat shall increase in such case– Can be reproduced by using fiohttp://git.kernel.dk/?p=fio.git

SSD IO Path

MLCCells

SLCCache

IOController

SATA IOs6Gb/sec

960Mb/sec

40Mb/sec

Weird Storage Issues

● Temperature

– On HDDs, thermal recalibration occurs too often to maintain a certain level of service.

– Media-class disks are less subject to this effect

● Vibrations

– Raid arrays contains several HDDs spinning constantly

– All this individual vibrations prevent heads being properly aligned leading to heads' recalibrations

– That could totally prevent a raid array from delivering IOs

IRQ Storms

● Inside a +1200 array of identical computers

● Some are booting very very slowly and engage some software watchdogs

● /proc/interrupts reports IRQ storm (66000 per sec) on interrupt 19

● CPU is permanently interrupted by IRQs

● AHCI controller floods as HDD doesn't answer on ATA_IDENTIFY requests (seen by extracting HDD)

● AHCI driver fails at probing so int19 only reports usb dev.

● Some hardware failures can lead to load issues

IRQ Storms

Memory Issues

● 2 identical servers that doesn't perform the same

– One is really slower than the other

● Same server brand / model● Same vendor

● Same hardware setup

● But really performs differently....

● What the hell my {application|os} is doing wrong here ?

Memory Issues

● Memory banks were not populated with the same HW

● Some were DDR3 with a CAS Latency = 9

● Some were DDR3 with a CAS Latency = 11

● As a result the memory access were slower on one

● This got detected at runtime under Linux with DDR3 timing tool from Cyring. (http://code.cyring.fr/FTS/?PATH=Source/C/DDR3_Timings/0.2/timings.c)

● Hardware setups were supposed to be the same !

Dear Loadavg,

● You are complicated to understand

● You don't help tracking the source of the load

● You can be a lier if some kernel code don't update you

● But you provide an indicator on the global load

– 1.0 means 100% of the ressources

● I'll keep you as a raw indicator to start my investigations

Thanks !

● Email : [email protected]

● IRC : erwan_taf @ {freenode | oftc }

Technology

Distro Recipes 2013 : My ${favorite_linux_distro} is slow!