39
HATAYAMA, Daisuke FUJITSU LIMITED. LinuxCon Japan 2013 Kdump: towards fast and stable crash dumping on terabyte- scale memory system Copyright 2013 FUJITSU LIMITED

Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

HATAYAMA, Daisuke

FUJITSU LIMITED.

LinuxCon Japan 2013

Kdump: towards fast and stable

crash dumping on terabyte-

scale memory system

Copyright 2013 FUJITSU LIMITED

Page 2: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Terabyte-scale memory system

Common on enterprise-class model line-up

Use cases

In-memory database

VM consolidation

HPC

Copyright 2013 FUJITSU LIMITED

Company Model Max Memory Size

SGI

SGI UV 2000 64TiB

HP HP ProLiant DL980 G7 4TiB

IBM System x3950 X5

3TiB

Fujitsu PRIMEQUEST 1800E 2TiB

1

Page 3: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Kdump issues on

terabyte-scale memory system

Collecting crash dump takes several hours without special handling

Dump filtering

• reduce size of crash dump – create “mini dump”

Design and implementation had no longer scaled

Even dump filtering could take tens of minutes

Crash dump could fail due to too much memory consumption

Still, full dump is necessary for Mission Critical use

Copyright 2013 FUJITSU LIMITED 2

Page 4: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

What this talk explains

How several issues on creating “mini dump” were resolved since last year

For most users who think “mini dump” is enough and they don’t require rigorous failure analysis

How can we collect “full dump” in a reasonable time and what challenges still there are

For users who think “full dump” is essential and require rigorous failure analysis on mission critical use

Copyright 2013 FUJITSU LIMITED 3

Page 5: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Agenda

Review kdump framework

Recently fixed issues for “mini dump”

Memory consumption issue

Dump filtering performance degradation issue

Ongoing and future work for “full dump”

Towards fast crash dumping for “full dump”

Copyright 2013 FUJITSU LIMITED 4

Page 6: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Copyright 2013 FUJITSU LIMITED

Review kdump framework

5

Page 7: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Kdump: kexec-based crash dumping

Linux kernel standard crash dump feature

Merged at 2.6.13

Components related to crash dumping

Capture kerenel

makedumpfile

/proc/vmcore

Copyright 2013 FUJITSU LIMITED

System kernel (1st kernel)

capture kernel (2nd kernel

)

kexec

memory

/proc/vmcore read

cp or makedumpfile

crash dump

write

disk

6

Page 8: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Kdump: kexec-based crash dumping

Linux kernel standard crash dump feature

Merged at 2.6.13

Components related to crash dumping

Capture kerenel

makedumpfile

/proc/vmcore

Copyright 2013 FUJITSU LIMITED

System kernel (1st kernel)

capture kernel (2nd kernel

)

kexec

memory

/proc/vmcore read

cp or makedumpfile

crash dump

write

disk

7

Page 9: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Capture kernel

Kernel booted from system kernel

Running on the memory reserved at system kernel

crashkernel=<memory size>

•128MiB ~ 512MiB

Has /proc/vmcore

Copyright 2013 FUJITSU LIMITED

System kernel (1st kernel)

capture kernel (2nd kernel

)

kexec

memory

/proc/vmcore read

cp or makedumpfile

crash dump

write

disk

8

Page 10: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

/proc/vmcore

File interface for capture kernel to access system kernel as ELF binary format

Use user-space tools for copying crash dump

• Make some additional processing if necessary

cp /proc/vmcore /mnt/dump

Copyright 2013 FUJITSU LIMITED

System kernel (1st kernel)

capture kernel (2nd kernel

)

kexec

memory

/proc/vmcore read

cp or makedumpfile

crash dump

write

disk

9

Page 11: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

System kernel (1st kernel)

capture kernel (2nd kernel

)

kexec

memory

/proc/vmcore read

cp or makedumpfile

crash dump

write

disk

makedumpfile

Copy vmcore with some additioanl processing

Compression

Dump filtering --- create mini dump

•Free pages

•Page cache

•Page cache (meta data)

•User data

•Zero page

Copyright 2013 FUJITSU LIMITED 10

Page 12: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Copyright 2013 FUJITSU LIMITED

Recently fixed issues for “mini dump”

Memory consumption issue

Dump filtering performance degradation issue

11

Page 13: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Copyright 2013 FUJITSU LIMITED

Recently fixed issues for “mini dump”

Memory consumption issue

Dump filtering performance degradation issue

12

Page 14: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Capture kernel has severe memory limitation

Reserved memory: 512 MiB

Cannot assume other disks

Continue working in ramdisk

• System kernel’s rootfs can be broken

On network configuration dump device is not connected

• scp /proc/vmcore [email protected]:/mnt/dump

important to use memory efficiently

Copyright 2013 FUJITSU LIMITED 13

Page 15: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Memory consumption issue

Problem

Makdumpfile had consumed much memory on terabyte-scale memory system

Killed by OOM killer, failing to collect crash dump

Cause Makedumpfile’s dump format has two bitmaps that grow in proportion to

system memory size

• 1TiB memory32 MiB bitmap

Two bitmaps are allocated at a time on memory

• On 8TiB, bitmap size exceeds size of reserved memory

• 2 * 32 * 8 = 512MiB

Copyright 2013 FUJITSU LIMITED

Disk dump header

Kdump sub header

Bitmap 1

Bitmap 2

Page descripter table

Page data

14

Page 16: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Solution: divide process into cycles

If N-cycles, bitmap size on memory is reduced to 1/N.

Repeat the following processing N-cycles:

• Create bitmap

• Write the corresponding pages in crash dump

Copyright 2013 FUJITSU LIMITED

bitmap

memory

Crash dump

bitmap

memory

Crash dump

Normal mode Cyclic mode

1st cycle

2nd cycle

15

Page 17: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

free_list cannot be divided into cycles

All the processing in makedumpfile needs to be divided into cycles

Unfortunately, current method of filtering free pages cannot be divided into cycles

Free_list is not sorted: linear search costs O(memory size ^2)

Copyright 2013 FUJITSU LIMITED

1st 2nd 3rd 4th 5th 6th 7th

zone

free_list

page

memory

free page

16

Page 18: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Look up mem_map for free page filtering

Like other memory types, looking up mem_map array (page descriptors) in case of free pages

Each page descriptor is sorted w.r.t. page frame numbers

Copyright 2013 FUJITSU LIMITED

memory

mem_map

page[]

PG_buddy flag

1st 2nd 3rd 4th 5th 6th 7th

17

Page 19: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Further merits of mem_map array logic

More robust against data corruption

If free_list is corrupted, cannot follow anymore.

Infinite loop at worst case.

Copyright 2013 FUJITSU LIMITED 18

Page 20: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Supported versions and usage

Supported versions

v1.5.0: Cyclic-mode, by Atsushi Kumagai

•default mode

v1.5.1: New free pages filtering logic, by HATAYAMA

Usage

--cyclic-buffer <memory in kilobytes>

•80% of free memory is specified at default

--non-cyclic

•Switch back to the original mode

Copyright 2013 FUJITSU LIMITED 19

Page 21: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Support for above-4GiB bzImage load

Yinghai Lu at kernel v3.9

There is no longer 896MiB limit

Kernel parameter

crashkernel=size[KMG],high

crashkernel=size[KMG],low

Each component needs to support new bzImage protocol

System kernel: v3.9~

Kexec: v2.0.4~

Capture kernel: v3.9~

Copyright 2013 FUJITSU LIMITED 20

Page 22: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Copyright 2013 FUJITSU LIMITED

Recently fixed issues for “mini dump”

Memory consumption issue

Dump filtering performance degradation issue

21

Page 23: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Dump filtering performance degradation issue

Dump filtering time becomes no longer ignorable on terabyte-scale memory

27% of Total is used for dump filtering

Causes are in

makedumpfile

/proc/vmcore

Copyright 2013 FUJITSU LIMITED

makedumpfile kernel Dump filter (s) Copy data (s) Total (s)

1.5.1 3.7.0 470.14 1269.25 1739.39

Memory: 2TiB

22

Page 24: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

makedumpfile: paging was bottleneck

Problem: on 4-level paging, reads /proc/vmcore 5 times

4-level page walk + target memory access

mem_map is mapped in VMEMMAP area, not direct mapping area

Solution: add cache for /proc/vmcore By Petr Tesarik

Keep previously read 8 pages

• Reduce the number of reads to /proc/vmcore for page table access

Simple but very effective

Copyright 2013 FUJITSU LIMITED

mem_map

VMEMMAP_START

23

Page 25: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

/proc/vmcore: ioremap/iounmap was bottleneck

Problem:

ioremap/iounmap is called per a single page

TLB flush is called on each iounmap call

Solution: Support mmap() on /proc/vmcore

Reduce the number of TLB flush by large mapping size

• Both mem_map for dump filtering and page frames are consecutive data

No copy from kernel-space to user-space

Copyright 2013 FUJITSU LIMITED 24

Page 26: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Filtering Benchmark (2TB, 1CPU) [1/2]

Smaller is better

Copyright 2013 FUJITSU LIMITED

Contributed by Jingbai Ma (https://lkml.org/lkml/2013/3/27/19)

1.5.1 1.5.3 1.5.3 1.5.3+mmap 1.5.3+mmap 1.5.3+mmap 1.5.3+mmap 1.5.3+mmap

Copy data (s) 1269.25 992.77 892.27 606.06 531.77 435.4 435.49 434.64

Filter pages (s) 470.14 190.05 98.885 82.44 44.3 41.83 41.72 41.92

0

200

400

600

800

1000

1200

1400

1600

1800

2000

tim

e [

seco

nd

s]

Copy data (s)

Filter pages (s)

25

Page 27: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Filtering Benchmark (2TB, 1CPU) [2/2]

Two experimental kernel-side filtering work

Cliff’s 240 seconds on 8TiB memory system http://lists.infradead.org/pipermail/kexec/2012-November/007177.html

Ma’s 17.50 seconds on 1TB memory system https://lkml.org/lkml/2013/3/7/275

Rough comparison

Copyright 2013 FUJITSU LIMITED

Filter pages [sec/TiB]

Cliff 30 Kernel-side at capture kernel

Ma 18 Kernel-side at system kernel

This work 21 User-side at capture kernel

26

Page 28: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Community development status

Kernel

Mmap support on /proc/vmcore, v3.10-rc2 mmotm by HATAYAMA

makedumpfile

Cache for /proc/vmcore, v1.5.2, by Petr Tesarik

Mmap support, v1.5.3-devel, by Atsushi Kumagai

Copyright 2013 FUJITSU LIMITED 27

Page 29: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Copyright 2013 FUJITSU LIMITED

Ongoing and future work for “full dump”

28

Page 30: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Importance of full crash dump

Bug on modern system is complicated

Failure analysis by experienced developers with full dump is a must

dump filtering could lost important data

E.g. qemu/KVM system

Communication between guests and host

Virtualization assist features

•Transparent hugepage, compaction

full dump including qemu guests' images view from guests helps failure analysis

Copyright 2013 FUJITSU LIMITED 29

Page 31: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Toward fast crash dumping

Goal: Complete 2TiB full crash dump within an hour

How:

Fast compression algorithm

• Fast enough even if data is not reduced at all

Data striping like RAID0

• Write data concurrently into multiple disks

Problem

On x86 capture kernel cannot use multiple CPUs now

• Limited to 1 at default

• maxcpus=1

Copyright 2013 FUJITSU LIMITED 30

Page 32: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

x86 boot protocol issue

If BSP receives INIT IPI, then BSP jumps at BIOS init code

System hang or system reset

Copyright 2013 FUJITSU LIMITED

BSP

AP

Crash!

AP cpu0

cpu1

cpu0

BIO

S IN

IT C

OD

E

memory

System Kernel Capture Kernel

BSP

31

Page 33: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Experimental benchmark

Purpose: see performance under the configuration with

• fast compression: LZO, snappy

• paralell I/O

• multiple CPUs

How

Process 10GiB data in system kernel

The data is randomized not to be reduced by compression

• To see if fast compression can be used for free even at worst case

Copyright 2013 FUJITSU LIMITED

OS RHEL 6.3

Kernel 2.6.32-279.el6.x86_64

CPU Intel® Xeon® CPU [email protected]

System PRIMEQUEST1800E

Disk (147GiB, 6 Gb/s) x 12

32

Page 34: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Benchmark result (1 CPU)

Larger is better

Copyright 2013 FUJITSU LIMITED

2 [TiB / h]

33

Page 35: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Benchmark result (12 CPUS)

Larger is better

Copyright 2013 FUJITSU LIMITED

2 [TiB / h]

34

Page 36: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Community development status

Enter capture kernel with BSP by switching IPI NMI Nacked Depending IPI NMI reduces reliability

Copyright 2013 FUJITSU LIMITED

AP

BSP

Crash!

IPI NMI BSP

BSP

cpu0

AP

cpu0

System kernel Unset BSP flag bit in MSR reg

using wrmsr instruction in System Kernel

Some say there should be firmware assuming BSP flag bit to be kept throughout runtime

Disable BSP at Capture Kernel

Cannot execute rdmsr to read BSP flag bit in MSR for the CPU not running

• Chicken-and-egg problem

Look up APIC/MP table to check if given CPU is BSP

AP

BSP

Crash!

AP ACPI table

35

Page 37: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Summary

Several issues for “mini dump” has been being improved

Memory consumption issue

Dump filtering performance degradation issue

Now mini dump can be collected with no problem

Ongoing work for “full dump”

Multiple CPUs are essential for “full dump” configuration

Under discussion in community

Copyright 2013 FUJITSU LIMITED 36

Page 38: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Future work

Improvements for Capture Kernel to show performance like the experimental benchmark

Kernel

• Multiple CPUs on capture kernel

• Support large pages on remap_pfn_range

Makedumpfile

• Support direct I/O for makedumpfile

• Make compression block size configurable

A lot of benchmark

Copyright 2013 FUJITSU LIMITED 37

Page 39: Kdump: towards fast and stable crash dumping on terabyte ... · Collecting crash dump takes several hours without special handling Dump filtering •reduce size of crash dump –

Copyright 2013 FUJITSU LIMITED 38