44

Click here to load reader

Linux Initialization Process (1)

Embed Size (px)

Citation preview

Page 1: Linux Initialization Process (1)

Initialization (1)Taku Shimosawa

Pour le livre nouveau du Linux noyau

1

Page 2: Linux Initialization Process (1)

Agenda

• Initialization Phase of the Linux Kernel• Turning on the paging feature

• Calling *init functions

• And miscellaneous things related to initialization

2

Page 3: Linux Initialization Process (1)

1. vmlinuxThis is the linux kernel

3

Page 4: Linux Initialization Process (1)

vmlinux• Main kernel binary

• Runs with the final CPU state• Protected Mode in x86_32 (i386)• Long Mode in x86_64• And so on…

• Runs in the virtual memory space• Above PAGE_OFFSET (default: 0xc0000000) (32-bit)• Above __START_KERNEL_map (default: 0xff…f80000000)

• i.e. All the absolute addresses in the binary are virtual ones

• Entry points

4

Architecture Name Location Name (secondary)

x86_32 startup_32 arch/x86/kernel/head_32.S startup_32_smp

x86_64 startup_64 arch/x86/kernel/head_64.S secondary_startup_64

ARM stext arch/arm/kernel/head[_nommu].S secondary_startup

ARM64 stext arch/arm64/kenel/head.S secondary_holding_pensecondary_entry

PPC _stext arch/powerpc/kernel/head_32.S* (__secondary_start)

Page 5: Linux Initialization Process (1)

Virtual memory mapping5

x86_64 Virtuali386 Virtual Physical

LOWMEM

PAGE_OFFSET(0xC0000000)

Up to ~896 MB

text/data

PAGE_OFFSET(0xFFFF8800

00000000)

__START_KERNEL_map(0xFFFFFFFF

80000000)

0x00000000 0x0000000000000000

0xFFFFFFFF

0xFFFFFFFFFFFFFFFF

2GB

Page 6: Linux Initialization Process (1)

Why different mapping in 64-bit?

• The kernel code, data, and BSS reside in the last 2-GB of the memory

=> Addressable by 32-bit!

• -mcmodel option in GCC• Specifies the assumptions for the size of code/data

sections

6

-mcmodel option (x86)

text data

small within 2GB

kernel within -2GB

medium within 2GB Can be > 2GB

large Anywhere in 64bit

Page 7: Linux Initialization Process (1)

Column: -mcmodel in gcc7

int g_data = 4;

int main(void){

g_data += 7;...}

8b 05 c6 0b 20 00 mov 0x200bc6(%rip),%eax # 601040 <g_data>...bf 01 00 00 00 mov $0x1,%edi8d 50 07 lea 0x7(%rax),%edx

48 b8 40 10 60 00 00 movabs $0x601040,%rax00 00 00bf 01 00 00 00 mov $0x1,%edi8b 30 mov (%rax),%esi...8d 56 07 lea 0x7(%rsi),%edx

large

#define SZ (1 << 30)

int buf[SZ] = {1};

int main(void){

buf[0] += 3;}

$ gcc -O3 -o ba -mcmodel=small bigarray.c/usr/lib/gcc/x86_64-linux-gnu/4.8/crtbegin.o: In function `deregister_tm_clones':crtstuff.c:(.text+0x1): relocation truncated to fit: R_X86_64_32 against symbol `__TMC_END__' defined in .datasection in ba

smallkernel

48 b8 60 10 a0 00 00 movabs $0xa01060,%rax00 00 008b 08 mov (%rax),%ecx8d 51 03 lea 0x3(%rcx),%edx

mediumlarge

*The offset of RIP-relative addressing is 32-bit

Page 8: Linux Initialization Process (1)

Column: -mcmodel in gcc (2)

• Code?

8

void nop(void){

asm volatile(".fill (2 << 30), 1, 0x90");}

$ gcc -O3 -o ba -mcmodel=small supernop.c/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu/crt1.o: In function `_start':(.text+0x12): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_fini' defined in .text section in /usr/lib/x86_64-linux-gnu/libc_nonshared.a(elf-init.oS)

$ gcc -O3 -o ba -mcmodel=large supernop.c/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu/crt1.o: In function `_start':(.text+0x12): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_fini' defined in .text section in /usr/lib/x86_64-linux-gnu/libc_nonshared.a(elf-init.oS)

smallmediumkernel

large

Page 9: Linux Initialization Process (1)

Initialization Overview9

Booting Code(Preparing CPU states, Gathering HW information, Decompressing vmlinux etc.)

arch/*/boot/

arch/*/kernel/head*.S, head*.c

Low-level Initialization(Switching to virtual memory world, Getting prepared for C programs)

init/main.c (startup_kernel)

Initialization(Initializing all the kernel features including architecture-dependent parts)

init/main.c (rest_init)

Creating the “init” process, and letting it the rest initialization(Setting up multiprocessing, scheduling)

kernel/sched/idle.c (cpu_idle_loop)

“Swapper” (PID=0) now sleeps

init/main.c (kernel_init)

Performing final initialization and“Exec”ing the “init” user process.

“init” (PID=1)

arch/*/kernel, arch/*/mm, …Call

vmlinux

Page 10: Linux Initialization Process (1)

2. Towards Virtual Memory

10

Page 11: Linux Initialization Process (1)

Enabling paging

• The early part is executed with paging off.• Physical address space

• vmlinux is assumed to be executed with paging on.• The addresses in the binary are not physical addresses.

• The first big job in vmlinux is enabling paging• Creating a (transitional) page table

• Setting the CPU to use the page table, and to enable paging

• Jumping to the entry point in C (compiled in the virtual address space)

11

Page 12: Linux Initialization Process (1)

Identity Map

• At first, the goal page table cannot be used• Since changing PC and enabling paging are (at least, in

x86) separate instructions.

12

PC

Physical Virtual

EnablePaging

Physical Virtual

Page Fault!

Page 13: Linux Initialization Process (1)

Identity Map

• Therefore, identity map is created in addition to the (goal) map.

13

PC

Physical Virtual

Jump

(1) Create an initial page table (2) Enable paging, andJump to a virtual address.

(3) Zap the lowmapping

Page 14: Linux Initialization Process (1)

Addresses in the transitional phase

• x86_64• The decompressing routine enables paging and creates

an identity page table (only for first 4GB)• Paging is required for CPUs to switch to 64-bit mode

• Located in 6 pages (pgtable) in the decompressing routine

• Symbols in vmlinux are accessed with RIP-relative• No trick is necessary for using the symbols

14

leaq _text(%rip), %rbpsubq $_text - __START_KERNEL_map, %rbp

...leaq early_level4_pgt(%rip), %rbx

...movq $(early_level4_pgt - __START_KERNEL_map), %raxaddq phys_base(%rip), %raxmovq %rax, %cr3movq $1f, %raxjmp *%rax

1: (arch/x86/kernel/head_64.S)

Page 15: Linux Initialization Process (1)

Addresses in the transitional phase

• i386• Symbols in vmlinux are accessed with absolute

addresses• Before paging is enabled, PAGE_OFFSET is always subtracted

from addresses

15

movl $pa(__bss_start),%edimovl $pa(__bss_stop),%ecxsubl %edi,%ecxshrl $2,%ecxrep ; stosl

...movl $pa(initial_page_table), %eaxmovl %eax,%cr3 /* set the page table pointer.. */movl $CR0_STATE,%eaxmovl %eax,%cr0 /* ..and set paging (PG) bit */ljmp $__BOOT_CS,$1f /* Clear prefetch and normalize %eip */

1:...

lgdt early_gdt_descrlidt idt_descr

#define pa(X) ((X) - __PAGE_OFFSET)

(arch/x86/kernel/head_32.S)

Page 16: Linux Initialization Process (1)

3. InitializationAt last, we have come here!

16

Page 17: Linux Initialization Process (1)

Initialization (start_kernel)

• A lot of *_init functions!• Furthermore, some init functions call another init

functions.

• At least, 80 functions are called in this function.

• This slide will pick up some topics from the initialization functions

17

Page 18: Linux Initialization Process (1)

2.9. Before InitializationA little more tricks

18

Page 19: Linux Initialization Process (1)

Special directives

• What are these?

• “I’m curious!”.

19

asmlinkage __visible void __init start_kernel(void) {…}

Page 20: Linux Initialization Process (1)

asmlinkage

• asmlinkage• Ensures the symbol is not mangled

• (in x86_32) Ensures all the parameters are passed by the stack

20

#ifdef CONFIG_X86_32#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))

arch/x86/include/asm/linkage.h

#ifdef __cplusplus#define CPP_ASMLINKAGE extern "C"#else#define CPP_ASMLINKAGE#endif

#ifndef asmlinkage#define asmlinkage CPP_ASMLINKAGE#endif

include/linux/linkage.h

Page 21: Linux Initialization Process (1)

__visible

• (Effective in gcc >=4.6)

21

#if GCC_VERSION >= 40600/** Tell the optimizer that something else uses this function or variable.*/#define __visible __attribute__((externally_visible))#endif

include/linux/compiler-gcc4.h

commit 9a858dc7cebce01a7bb616bebb85087fa2b40871author Andi Kleen <[email protected]> Mon Sep 17 21:09:15 2012committer Linus Torvalds <[email protected]> Mon Sep 17 22:00:38 2012

compiler.h: add __visible

gcc 4.6+ has support for a externally_visible attribute that prevents theoptimizer from optimizing unused symbols away. Add a __visible macro touse it with that compiler version or later.

This is used (at least) by the "Link Time Optimization" patchset.

Page 22: Linux Initialization Process (1)

__init (1)

• To mark code(text) and data as only necessary during initialization

22

#define __init __section(.init.text) __cold notrace#define __initdata __section(.init.data)#define __initconst __constsection(.init.rodata)#define __exitdata __section(.exit.data)#define __exit_call __used __section(.exitcall.exit)

(include/linux/init.h)

#ifndef __cold#define __cold __attribute__((__cold__))#endif

(include/linux/compiler-gcc4.h)#ifndef __section# define __section(S) __attribute__ ((__section__(#S)))#endif...#define notrace __attribute__((no_instrument_function))

(include/linux/compiler.h)

Page 23: Linux Initialization Process (1)

__init (2)• The init* sections are concentrated to a contiguous memory area

23

. = ALIGN(PAGE_SIZE);

.init.begin : AT(ADDR(.init.begin) - LOAD_OFFSET) {__init_begin = .; /* paired with __init_end */

}...

INIT_TEXT_SECTION(PAGE_SIZE)#ifdef CONFIG_X86_64

:init#endif

INIT_DATA_SECTION(16)....

. = ALIGN(PAGE_SIZE);...

.init.end : AT(ADDR(.init.end) - LOAD_OFFSET) {__init_end = .;

}arch/x86/kernel/vmlinux.lds.S

init.textinit.data

__init_begin

__init_end

Page 24: Linux Initialization Process (1)

__init (3)

• And, they are discarded (free’d) after initialization• Called from kernel_init

24

void free_initmem(void){

free_init_pages("unused kernel",(unsigned long)(&__init_begin),(unsigned long)(&__init_end));

}arch/x86/mm/init.c

void free_initmem(void){...

poison_init_mem(__init_begin, __init_end - __init_begin);if (!machine_is_integrator() && !machine_is_cintegrator())

free_initmem_default(-1);}

arch/arm/mm/init.c

Page 25: Linux Initialization Process (1)

head32.c, head64.c

• Before calling start_kernel, i386_start_kernel or x86_64_start_kernel is called in x86• Located in arch/x86/kernel/head{32,64}.c

• No underscore between head and 32!

• x86 (32-bit)• Reserve BIOS memory (in conventional memory)

• x86 (64-bit)• Erase the identity map

• Clear BSS, copy boot information from the low memory

• And reserve BIOS memory

25

Page 26: Linux Initialization Process (1)

Reserve? But how?• This is very initial time. No complicated memory

management is working right now.• memblock (Logical memory blocks) is working!

• memblock simply manages memory blocks• And in some architecture, information is took over to another

mechanism, and discarded after initialization

26

#define BIOS_LOWMEM_KILOBYTES 0x413lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);lowmem <<= 10;...memblock_reserve(lowmem, 0x100000 - lowmem);

arch/x86/kernel/head.c

#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK#define __init_memblock __meminit#define __initdata_memblock __meminitdata#else...#endif

include/linux/memblock.h

Set in S+Core, IA64, S390, SH,MIPS and x86

Without memory hotplug,__meminit is __init.

Page 27: Linux Initialization Process (1)

memblock

• Data Structure (include/linux/memblock.h)

• Initially the arrays are allocated statically

27

memblock (memblock)

memory(memblock_type)

reserved(memblock_type)

memblock_region

• base, size, flags[, nid]

memblock_region

memblock_region

memblock_region

Array of memblock_region

Array of memblock_region

static struct memblock_regionmemblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;static struct memblock_regionmemblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;

*INIT_MEMBLOCK_REGIONS = 128

(memblock: Global variable)

Page 28: Linux Initialization Process (1)

Reserving in memblock

• Reserving adds the region to the region array in the “reserved” type

• A function to adding the available region is memblock_add

28

static int __init_memblock memblock_reserve_region(phys_addr_t base,phys_addr_t size,int nid,unsigned long flags)

{struct memblock_type *_rgn = &memblock.reserved;

...return memblock_add_region(_rgn, base, size, nid, flags);

}

int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size){

return memblock_reserve_region(base, size, MAX_NUMNODES, 0);}

Page 29: Linux Initialization Process (1)

When the available memory is added?• x86

• memblock_x86_fill• called by setup_arch (8/80)

• ARM• arm_memblock_init

• Also called by setup_arch (8/80)

29

void __init memblock_x86_fill(void){...

memblock_allow_resize();

for (i = 0; i < e820.nr_map; i++) {... memblock_add(ei->addr, ei->size);

}memblock_trim_memory(PAGE_SIZE);

...}

BTW, what’s this?

Page 30: Linux Initialization Process (1)

Resizing, or reallocation.

• Memblock uses slab for resizing if available• # of e820 entries may be more than 128

• However, slab is available at kmem_cache_init called by mm_init (25/80), so not at this time.

• Memblock tries to allocate by itself by finding an area in memory && !reserved.

30

static int __init_memblock memblock_double_array(struct memblock_type *type,phys_addr_t new_area_start,phys_addr_t new_area_size)

{…

addr = memblock_find_in_range(new_area_start + new_area_size,memblock.current_limit,new_alloc_size, PAGE_SIZE);

Page 31: Linux Initialization Process (1)

memblock: Debug options

• “memblock=debug”

31

static int __init early_memblock(char *p){

if (p && strstr(p, "debug"))memblock_debug = 1;

return 0;}early_param("memblock", early_memblock);

static int __init_memblock memblock_reserve_region(...){...

memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",

(unsigned long long)base,(unsigned long long)base + size - 1,flags, (void *)_RET_IP_);

Page 32: Linux Initialization Process (1)

3. InitializationOkay, okay.

32

Page 33: Linux Initialization Process (1)

start_kernel

• What’s the first initialization function called?

33

smp_setup_processor_id() ((at least 2.6.18) ~ 3.2)

lockdep_init () (3.3 ~)

commit 73839c5b2eacc15cb0aa79c69b285fc659fa8851Author: Ming Lei <[email protected]>Date: Thu Nov 17 13:34:31 2011 +0800

init/main.c: Execute lockdep_init() as early as possibleThis patch fixes a lockdep warning on ARM platforms:

[ 0.000000] WARNING: lockdep init error! Arch code didn't call lockdep_init() early enough?

[ 0.000000] Call stack leading to lockdep invocation was:[ 0.000000] [<c00164bc>] save_stack_trace_tsk+0x0/0x90[ 0.000000] [<ffffffff>] 0xffffffff

The warning is caused by printk inside smp_setup_processor_id().

Page 34: Linux Initialization Process (1)

init (1/80) : lockdep_init• Initializes lockdep (lock validator)

• “Runtime locking correctness validator”• Detects

• Lock inversion• Circular lock dependencies

• When enabled, lockdep is called when any spinlock or mutex is acquired.• Thus, the initialization for lockdep must be first.

• Initialization is simple (just initializing list_head’s of hashes)

34

void lockdep_init(void){...

for (i = 0; i < CLASSHASH_SIZE; i++)INIT_LIST_HEAD(classhash_table + i);

for (i = 0; i < CHAINHASH_SIZE; i++)INIT_LIST_HEAD(chainhash_table + i);

...}kernel/locking/lockdep.c

Config: CONFIG_LOCKDEPselected by PROVE_LOCKINGor DEBUG_LOCK_ALLOCor LOCK_STAT

Page 35: Linux Initialization Process (1)

init (2/80) : smp_setup_processor_id

• Only effective in some architecture• ARM, s390, SPARC

35

u32 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = MPIDR_INVALID };void __init smp_setup_processor_id(void){

int i;u32 mpidr = is_smp() ? read_cpuid_mpidr() &

MPIDR_HWID_BITMASK : 0;u32 cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);

cpu_logical_map(0) = cpu;for (i = 1; i < nr_cpu_ids; ++i)

cpu_logical_map(i) = i == cpu ? 0 : i;set_my_cpu_offset(0);

pr_info("Booting Linux on physical CPU 0x%x\n", mpidr);}

arch/arm/kernel/setup.c

Hardware CPU (core) ID

Exchange the logical ID for the boot CPU and the logical ID for the CPU 0.

12 0 3cpu_logical_map:

Page 36: Linux Initialization Process (1)

init (3/80) : debug_objects_early_init

• Initializes debugobjects• Lifetime debugging facility for objects

• Seems to be used by timer, hrtimer, workqueue, per_cpu_counter and rcu

• Again, this function initializes locks and listheads

36

Config: CONFIG_DEBUG_OBJECTS

void __init debug_objects_early_init(void){

int i;

for (i = 0; i < ODEBUG_HASH_SIZE; i++)raw_spin_lock_init(&obj_hash[i].lock);

for (i = 0; i < ODEBUG_POOL_SIZE; i++)hlist_add_head(&obj_static_pool[i].node, &obj_pool);

}lib/debugobjects.c

Page 37: Linux Initialization Process (1)

init (4/80): boot_init_stack_canary

• Setup the stackprotector• include/asm/stackprotector.h

• Decide the canary value based on random value and TSC

37

static __always_inline void boot_init_stack_canary(void){

u64 canary;u64 tsc;

#ifdef CONFIG_X86_64BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40);

#endifget_random_bytes(&canary, sizeof(canary));tsc = __native_read_tsc();canary += tsc + (tsc << 32UL);

current->stack_canary = canary;#ifdef CONFIG_X86_64

this_cpu_write(irq_stack_union.stack_canary, canary);#else

this_cpu_write(stack_canary.canary, canary);#endif}

Page 38: Linux Initialization Process (1)

init (5/80): cgroup_init_early

• Initializes cgroups• For subsystems that have early_init set, initialize the

subsystem.• cpu, cpuacct, cpuset

• The rest of subsystems are initialized in cgroup_init (71/80)

• Initializes the structure, and the names for the subsystems

38

Page 39: Linux Initialization Process (1)

init (6/80): boot_cpu_init• Initializes various cpumasks for the boot CPU

• online : available to scheduler• active : available to migration• present : cpu is populated• possible : cpu is populatable

• set_cpu_online adds the cpu to active

• set_cpu_present does not add the cpu to possible

39

static void __init boot_cpu_init(void){

int cpu = smp_processor_id();/* Mark the boot cpu "present", "online" etc for SMP and UP

case */set_cpu_online(cpu, true);set_cpu_active(cpu, true);set_cpu_present(cpu, true);set_cpu_possible(cpu, true);

}init/main.c

!HOTPLUG_CPU => same

!HOTPLUG_CPU => same

Page 40: Linux Initialization Process (1)

cpumask

• A bit map

40

typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;include/linux/cpumask.h

#define DECLARE_BITMAP(name,bits) \unsigned long name[BITS_TO_LONGS(bits)]

include/linux/types.h

#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))

include/linux/bitops.h

NR_CPU bits

bits :

array of long (4 / 8 bytes)

Page 41: Linux Initialization Process (1)

Set bit! (x86)

• The register bitoffset operand for bts is• -231 ~ 231-1 or -263 ~ 263-1

41

#define IS_IMMEDIATE(nr) (__builtin_constant_p(nr))...static __always_inline voidset_bit(long nr, volatile unsigned long *addr){

if (IS_IMMEDIATE(nr)) {asm volatile(LOCK_PREFIX "orb %1,%0"

: CONST_MASK_ADDR(nr, addr): "iq" ((u8)CONST_MASK(nr)): "memory");

} else {asm volatile(LOCK_PREFIX "bts %1,%0"

: BITOP_ADDR(addr) : "Ir" (nr) : "memory");}

}arch/x86/include/asm/bitops.h

Page 42: Linux Initialization Process (1)

Set bit! (ARM)42

#if __LINUX_ARM_ARCH__ >= 6.macro bitop, name, instr

ENTRY( ¥name )UNWIND( .fnstart)

ands ip, r1, #3strneb r1, [ip] @ assert word-alignedmov r2, #1and r3, r0, #31 @ Get bit offsetmov r0, r0, lsr #5add r1, r1, r0, lsl #2 @ Get word offset

...mov r3, r2, lsl r3

1: ldrex r2, [r1]¥instr r2, r2, r3strex r0, r2, [r1]cmp r0, #0bne 1bbx lr

UNWIND( .fnend )ENDPROC(¥name )

.endm

bitop _set_bit, orr

Page 43: Linux Initialization Process (1)

smp_processor_id• Returns the core ID (in the kernel)• In ARM (and old days in x86)

• Located in “current”• Located in the top of the current stack

• In x86• Located in the per-cpu area.

43

#define raw_smp_processor_id() (this_cpu_read(cpu_number))arch/x86/include/asm/smp.h

#define raw_smp_processor_id() (current_thread_info()->cpu)arch/arm/include/asm/smp.h

static inline struct thread_info *current_thread_info(void){

register unsigned long sp asm ("sp");return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));

}arch/arm/include/asm/thread_info.h

Page 44: Linux Initialization Process (1)

Next

• Topics and the rest of initialization• Setup parameters (early_param() etc.)

• Initcalls

• Multiprocessor supports• Per-cpus

• SMP boot (secondary boot)• SMP altenatives

• And other alternatives

• And Others?• Modules?

44