of 44 /44
Initialization (2) Taku Shimosawa Pour le livre nouveau du Linux noyau 1

Linux Initialization Process (2)

Embed Size (px)

Text of Linux Initialization Process (2)

Bootstrapping Code in Linux Kernel

Initialization (2)Taku ShimosawaPour le livre nouveau du Linux noyau1AgendaInitialization function listThe list of the functions called from the kernel startup function (start_kernel)The list of the functions called from some function called from the start_kernel functionsetup_archrest_init, and the following functionsInitialization topicsMultiprocessor (SMP) Initialization23. InitializationAt last, we have come here!3Initialization Overview4Booting Code(Preparing CPU states, Gathering HW information, Decompressing vmlinux etc.)arch/*/boot/arch/*/kernel/head*.S, head*.cLow-level Initialization(Switching to virtual memory world, Getting prepared for C programs)init/main.c (startup_kernel)Initialization(Initializing all the kernel features including architecture-dependent parts)init/main.c (rest_init)Creating the init process, and letting it the rest initialization(Setting up multiprocessing, scheduling) kernel/sched/idle.c (cpu_idle_loop)Swapper (PID=0) now sleepsinit/main.c (kernel_init)Performing final initialization andExecing the init user process.init (PID=1)arch/*/kernel, arch/*/mm, Callvmlinuxstart_kernel (1)5#FunctionCategoryDescription1lockdep_initDebugLock validator2smp_setup_processor_id*SMPInitialize processor ID (some architecture)3debug_objects_early_initDebugLifetime debugging facility for objects4boot_init_stack_canary*DebugDecide the canary value for the stack protector5cgroup_init_earlycgroupEarly init for some cgroup subsystems6boot_cpu_initSMPSet the boot cpu for various cpumasks 7page_address_initMMInitialize hash for kmap (highmem)8setup_arch*9mm_init_ownerMMSet init_mms owner to init_task10mm_init_cpumaskMMSet the cpu mask pointer to the mms cpumask (only if CPUMASK_OFFSTACK)11setup_command_lineInitCopy the command line parameter to newly allocated buffer (allocated by memblock)12setup_nr_cpu_idsSMPSet nr_cpu_ids according to the last bit in possible maskFunctions with * : mostly architecture dependent codesstart_kernel (2)6#FunctionCategoryDescription13setup_per_cpu_areas*SMPAllocate and initialize percpu areas14smp_prepare_boot_cpu*SMPPrepare for SMP boot15build_all_zonelistsMMInitializes zonelist16page_alloc_initMMAdd a handler for CPU hotplug (to drain pages)17parse_early_paramInitParse early options18parse_argsInitParse the rest of options19jump_label_initOptionJump label (self-modification)20setup_log_bufDebugAllocate and initialize printk log buffer21pidhash_initSchedInitialize PID hash22vfs_caches_initFSInitialize various caches (kmem_cache) in VFS (dcache, inode, mnt, files, )23sort_main_extableMMSort the exception table (used in page faults)24trap_init*CPUInitialize trap handlersstart_kernel (3)7#FunctionCategoryDescription25mm_initMMInitialize MM25Apage_cgroup_init_flatmmeMMAllocate pages for page_cgroup25Bmem_init*MMFree pages for buddy allocator25Ckmem_cache_initMMInitialize cache25Dpercpu_init_lateMMReplaces per-cpu chunks with those allocated by slab25Epgtable_init*MMCreate cache for ptlock and pgtable (SH etc.)25Fvmalloc_initMMInitialize vmalloc26sched_initSchedInitialize scheduler27idr_cache_initUtilInitialize IDR (ID to pointer translation)28rcu_initSMPInitialize RCU29tick_nohz_initSchedInitialize NOHZ (enable context tracking)30radix_tree_initUtilInitialize radix tree (create cache, etc.)31early_irq_init*CPUInitialize irq_desc.start_kernel (4)8# FunctionCategoryDescription32init_IRQ *CPUInitialize various IRQs (in x86, set gates for APIC interrupts, etc.)33tick_initTimerTick broadcast (to emulate local timer)34init_timersTimerTimer stats, notifier, and timer softirq35hrtimers_initTimerhrtimer notifier, and hrtimer softirq36softirq_initSchedTasklet lists, and tasklet softirqs37timekeeping_initTimerClocksource38time_init *Timer(Platform-dependent) timer initialization39sched_clock_postinitSchedStart the hrtimer40perf_event_initDebugPerf events41profile_initDebug(Simple) profiler42call_function_initSMPInitialize csd (call single data) queuelocal_irq_enableCPUAt this point, interrupts are enabledstart_kernel (5)9# FunctionCategoryDescription43kmem_cache_init_lateMMPost-initialization of cache (slab)44console_initConsoleCall console initcalls45lockdep_infoDebugPrint lockdep information46locking_selftestDebugTest spinlocks, rwlocks, mutexes, and rwsemaphores47page_cgroup_initcgroupPage cgroup48debug_objects_mem_initDebugEnable dynamic allocation for debugobjects (#3), and replace static ones with newly allocated one49kmemleak_initDebugkmemleak (Memory leak check facility)50setup_per_cpu_pagesetMMPer-cpu pageset51numa_policy_initMMNUMA (VMA) policy52late_time_init*TimerLate initialization(In x86, HPET and TSC are initialized)start_kernel (6)10# FunctionCategoryDescription53sched_clock_initSchedSet the time info for scheduler54calibrate_delayTimerCalibrate for the delay functions55pidmap_initProcessInit PID map for initial PID namespace56anon_vma_initMMCreate cache for anon_vma57acpi_early_initACPIACPI Subsystems, load DSDT58thread_info_cache_initProcessAllocate cache for thread_info if its size is less than PAGE_SIZE59cred_initSecurityTask credential60fork_initProcessAllocate a cache for task_struct61proc_caches_initMMAllocate caches for mm_struct, etc.62buffer_initFSAllocate a cache for buffer_head63key_initSecurityAllocate a cache for key_jar64security_initSecurityCall security_initcalls65dbg_late_initDebugLate init for kgdbstart_kernel (7)11# FunctionCategoryDescription66vfs_caches_initFSAllocate SLAB caches and hashtables for various VFS caches (dcache, inode_cache, ) 67signals_initSchedAllocate a cache for sigqueue68page_writeback_initMMInitialize the ratio for the dirty pages69proc_root_initProcfsCreate the root for procfs and some directories70cgroup_initCgroupInitialize the rest of cgroups71cpuset_initSchedThe top-level cpuset72taskstats_init_earlySchedTask statistics exposed to the user level73delayacct_initSchedTask delay accounting74check_bugs*CPUFix up for some architecture-dependent bugs(in x86_64, alternatives are initialized, and divide the first 2MB page into 4K pages)75sfi_init_lateSFIMap again the area by using ioremapstart_kernel (8)12# FunctionCategoryDescription76ftrace_initDebugftrace77rest_initsetup_arch (x86) (1)13#FunctionCategoryDescription1memblock_reserveMMReserve the text area2early_reserve_initrdMMReserve the initrd area3clone_pgd_area, load_cr3MMSwitch to swapper_pg_dir (i386 only)4olpc_ofw_detectPlatformOLPC OFW Stuff5early_trap_initCPUInit debug and int3 gate6early_cpu_initCPUDetect CPUs vendor (registered in cpu_dev_register: Intel, AMD, Cyrix) and calls early_init and bsp_init7early_ioremap_initMMInit early ioremap8setup_olpc_ofw_pgdPlatformOLPC OFW Stuff9(Parsing boot parameters)Setup--10x86_init.oem.arch_setupPlatformOEM-dependent setup (Intel MID etc.)11setup_memory_mapMMCopy and print e820 information12parse_setup_dataSetupParse setup_data in boot_paramssetup_arch (x86) (2)14#FunctionCategoryDescription13copy_eddSetupCopy BIOS EDD information14(prepare init_mm)MMSet start_code, end_code, etc. for init_mm15(command line stuffs)Setup16x86_configure_nxMMSet ptemask according to whether NX is supported by CPU17parse_early_paramSetup(=#17 in start_kernel)18x86_report_nxMMPrint NX information19memblock_x86_reserve_range_setup_dataMMReserve the setup_data area20acpi_mps_checkSMPCheck if ACPI is disabled and MPS code is not built-in21early_pci_dump_devicesDeviceDump PCI info before PCI is initialized22e820_reserve_setup_dataMMReserve the setup_data area in e82023finish_e820_parsingSetupSanitize e820 info and print e820 info.setup_arch (x86) (3)15#FunctionCategoryDescription13copy_eddSetupCopy BIOS EDD information14(prepare init_mm)MMSet start_code, end_code, etc. for init_mm15(command line stuffs)Setup16x86_configure_nxMMSet ptemask according to whether NX is supported by CPU17parse_early_paramSetup(=#17 in start_kernel)18x86_report_nxMMPrint NX information19memblock_x86_reserve_range_setup_dataMMReserve the setup_data area20acpi_mps_checkSMPCheck if ACPI is disabled and MPS code is not built-in21early_pci_dump_devicesDeviceDump PCI info before PCI is initialized22e820_reserve_setup_dataMMReserve the setup_data area in e82023finish_e820_parsingSetupSanitize e820 info and print e820 info.setup_arch (x86) (4)16#FunctionCat.Description24dmi_scan_machineDMICheck if DMI (Desktop Management Interface) is present or not25dmi_memdev_walkDMIWalk through the DMI table26dmi_set_dump_stack_arch_descDMISet architecture description* for dump_stack27init_hypervisor_platformVMGet the hypervisor information and init(e.g. Get Hz using special I/O port when running on VMWare)28probe_romsMMRequest resources for Video ROM, Extension ROMs, etc.29insert_resourceMMInsert resources for kernels code, data, BSS30e820_add_kernel_rangeMMAdd kernel code, data areas to e820 if is not marked as E820_RAM31trim_bios_rangeMMReserve BIOS areas in e820(*) Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3CPU: 3 PID: 2763 Comm: irqbalance Tainted: G W 3.14.13 #1Hardware name: Supermicro X9SRH-7F/7TF/X9SRH-7F77TF, BIOS 3.00 07/05/2013setup_arch (x86) (5)17#FunctionCategoryDescription32early_gart_iommu_checkDeviceCheck GART (Graphics Address Remapping Table)33(Substitute to max_pfn)MMSet max_pfn as the last page in e82034mtrr_bp_initCPUMTRRs (Memory Type Range Registers)35check_x2apicCPUEnable X2APIC if available36find_smp_configSMPFind the SMP config for Intel MP Spec.37reserve_ibft_regionDeviceReserve iSCSI Boot Format Table38early_alloc_pgt_bufMMAllocate page table buffer (to be used in the early stage)39reserve_brkMMReserve brk area40cleanup_highmapMMUnmap out-of-range areas in the kernel map41memblock_set_current_limitMMSet the memblocks allocation limit to ISA_END_ADDRESS42memblock_x86_fillMMFill the memblock info according to e820setup_arch (x86) (6)18#FunctionCategoryDescription43early_reserve_e820_mpc_newSMPAllocate for mptable44setup_bios_corruption_checkSetupFill 64KB of low memory by some pattern to detect if BIOS corrupts the area45reserve_real_modeCPU/SMPReserve some low memory for trampoline46trim_platform_memory_rangesSetupSpecial tricks (reserve) for some platform(Some Sandy Bridge)47trim_low_memory_rangeMMReserve the first 4KB page in memblock48init_mem_mappingMMReconstruct memory mapping49early_trap_pf_initCPUSet page fault handler50setup_real_modeCPU/SMPSetup the trampoline code51memblock_set_current_limitMMChange the limit to the last page mapped52dma_contiguous_reserveMMAllocate contiguous area for DMAsetup_arch (x86) (7)19#FunctionCat.Description53setup_log_bufDebugSetup printk log buffer54reserve_initrdMMReserve the initrd55acpi_initrd_overrideACPIFind the ACPI override info in initrd56vsmp_initSetupvSMP (ScaleMP Inc.)57io_delay_initSetupCheck DMI override for I/O delay strategy58acpi_boot_table_initACPIACPI BOOT table parsing59early_acpi_boot_initACPIParse MADT in ACPI60initmem_initMMSetup node information based on ACPI (if NUMA)61reserve_crashkernelDebugReserve memory for crashkernel62memblock_find_dma_reserveMMCount the reserved pages in DMA zone63pagetable_initMMInitialize sparse mem, and zone sizes64tboot_initCPUIntel TXT (Trusted eXecution Technology) supportsetup_arch (x86) (8)20#FunctionCat.Description65map_vsyscallCPUMap vsyscall66generic_apic_probeCPUProbe APIC driver67early_quirksPCIApply some quirks for certain devices68acpi_boot_initACPIParse (again) BOOT, FADT, MADT, HPET etc.69sfi_initSFISFI (Simple Firmware Interface)70x86_dtb_initSetupDevice tree71get_smp_configSMP(If ACPI is not found) construct the table72prefill_possible_mapSMPSet the possible CPU map73init_cpu_to_nodeNUMASet up the cpu to node map74init_apic_mappingsCPUSet the local APIC address75x86_io_apic_ops.initCPUI/O APIC 76kvm_guest_initVirt.KVM Guest (paravirt ops, etc.)77e820_reserve_resourcesMMReserve resources for e820 entriessetup_arch (x86) (9)21#FunctionCat.Description78e820_mark_nosave_regionsPMAdd non-RAM area in e820 to nosave regions79x86_init.resources.reserve_resourcesI/OReserve standard I/O resources (Timer, KB,)80e820_setup_gapMMFind the largest gap in e820, and pass PCI to use the gap to allocate new MMIO areas81x86_init.oem.bannerDebugBooting paravirtualized kernel on %s82x86_init.timers.wallclock_initTimer(NOP; defined in MID only)83mcheck_initCPUMachine check (temperature)84arch_init_ideal_nops CPUSet the NOP instructions ideal to the current platform85register_refined_jiffiesTimerRegister refined_jiffies clocksourcesetup_arch (ARM) (1)22#FunctionCategoryDescription1setup_processorCPUProcessor initialization2setup_machine_fdtSetupParse the device tree3setup_machine_tagsSetupIf 2 is failed, parse the ATAGs4(prepare init_mm)MMSet start_code, end_code, etc. for init_mm5(command line stuffs)Setup(=#15 in x86)6parse_early_paramSetup(=#17 in x86)7(sort meminfo)MMSort the memory information8early_paging_initMMRecreate the page table prepared during boot9setup_dma_zoneMMSetup the dma zone information10sanity_check_meminfoMMSanitize the meminfo11arm_memblock_initMMAdd free memory from meminfo, and reserve various reserved areas.12paging_initMMPermanent kmap areasetup_arch (ARM) (2)23#FunctionCategoryDescription13request_standard_resourcesMMReserve resources for system memory, video ram14unflatten_device_treeSetupCreate a tree from FDT15arm_dt_init_cpu_mapsCPUCreate CPU logical map based on the device tree16psci_initCPURead the method to be used for CPU on, off, etc.17smp_init_cpusSMPInitialize the CPU cores available18smp_build_mpidr_hashSMPPrecompute shifts required to get index from MPIDR (Mulitprocessor ID register) value19hyp_mode_checkVirt.Check if the CPU is running in HYP mode20reserve_crashkernelDebugReserve memory for crashkernel21mdesc->init_early(Platform-specific initialization)The rest of initializationrest_init (init/main.c)Create two kernel threadsinit (PID = 1, gradually it becomes the init user process)kthreadd (PID = 2, to allow init to create another kernel threads)

24static noinline void __init_refok rest_init(void){rcu_scheduler_starting();...kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);numa_default_policy();pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);rcu_read_lock();kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);rcu_read_unlock();complete(&kthreadd_done);...init_idle_bootup_task(current);schedule_preempt_disabled();...cpu_startup_entry(CPUHP_ONLINE);}Idle taskBefore entering idle, it calls scheduler.

Then, call the idle function25...init_idle_bootup_task(current);schedule_preempt_disabled();...cpu_startup_entry(CPUHP_ONLINE);}void __sched schedule_preempt_disabled(void){sched_preempt_enable_no_resched();schedule();preempt_disable();}void cpu_startup_entry(enum cpuhp_state state){...__current_set_polling();arch_cpu_idle_prepare();cpu_idle_loop();}kernel_initCall the remaining init functions (kernel_init_freeable)Synchronize all the asynchronous operationsFree the initmem (free_initmem)Mark RO Data to RO (and NX) (mark_rodata_ro)Set the system state to SYSTEM_RUNNINGSet the current NUMA policy to default (numa_default_policy)Try to execve(2) init processIf rdinit parameter is set, exec the pathIf init parameter is set, exec the pathTry to run /sbin/init, /etc/init, /bin/init, /bin/shIf nothing worked, panic with a familiar message:

26"No working init found. Try passing init= option to kernel. See Linux Documentation/init.txt for guidance."kernel_init_freeableFirst, wait for the completion of kthreadds setupSet inits allowed cpus/mems to all CPUs and nodesSet cad_pid to initsPrepare to boot other CPUs (smp_prepare_cpus)Call early initcalls (do_pre_smp_initcalls)Initialize lockup_detector (lockup_detector_init)Initialize multiprocessor (smp_init)Boots up other cores/socketsInitialize the scheduler (sched_init_smp)Call the do_basic_setup function (-> Next slide)Open /dev/console and dup twice (fd : 0 to 2)Check if the ramdisk is availableIf not, try to mount root (prepare_namespace)Load the I/O scheduler (elevator) module

27do_basic_setupRe-initialize cpuset to the active CPUs (cpuset_init_smp)Initialize user-mode helper (khelper)Initialize tmpfs (shmem_init)Initialize drivers (driver_init)Create proc directories and files for IRQs (init_irq_proc)Call constructors (do_ctors) (CONFIG_CONSTRUCORS)Enable the user-mode helper workqueueCall all the initcalls (do_initcalls)Initialize random values (random_int_secret_init)

28initcallsFacility to call initialization functions during the initialization (in the kernel_init_freeable function)Example29static int cpu_pm_init(void){register_syscore_ops(&cpu_pm_syscore_ops);return 0;}core_initcall(cpu_pm_init);(kernel/cpu_pm.c)Level of initcallsSeveral levels (the order to call) are defined30MacroLv. #Descriptionearly_initcallearlycalled before smppure_initcall0no dependency, variable initizalizationcore_initcall{,_sync}1, 1spostcore_initcall{,_sync}2, 2sarch_initcall{,_sync}3, 3ssubsys_initcall{,_sync}4, 4sfs_initcall{,_sync}5, 5srootfs_initcallrootfsdevice_initcall{,_sync}6, 6slate_initcall{,_sync}7, 7sInitcall definitionCollect all the pointers for initcall functions at certain sectionsSection name : .initcall lv .initE.g. for core_initcall, the section will be .initcall1.init

31#define __define_initcall(fn, id) \static initcall_t __initcall_##fn##id __used \__attribute__((__section__(".initcall" #id ".init"))) = fn; \LTO_REFERENCE_INITCALL(__initcall_##fn##id)(include/linux/init.h)In the LD script32#define INIT_CALLS\VMLINUX_SYMBOL(__initcall_start) = .;\*(.initcallearly.init)\INIT_CALLS_LEVEL(0)\INIT_CALLS_LEVEL(1)\INIT_CALLS_LEVEL(2)\INIT_CALLS_LEVEL(3)\INIT_CALLS_LEVEL(4)\INIT_CALLS_LEVEL(5)\INIT_CALLS_LEVEL(rootfs)\INIT_CALLS_LEVEL(6)\INIT_CALLS_LEVEL(7)\VMLINUX_SYMBOL(__initcall_end) = .;(include/asm-generic/vmlinux.lds.h)#define INIT_CALLS_LEVEL(level)\VMLINUX_SYMBOL(__initcall##level##_start) = .;\*(.initcall##level##.init)\*(.initcall##level##s.init)\(include/asm-generic/vmlinux.lds.h)Special initcallsconsole_initcallCalled from console_init (in kernel_start)security_initcallCalled from security_init (in kernel_start)

When used in loadable modules (not recommended), its replaced by module_init33#else /* MODULE */

/* Don't use these in loadable modules, but some people do... */#define early_initcall(fn)module_init(fn)#define core_initcall(fn)module_init(fn)...(include/linux/init.h)Initcall debugKernel command-line option: initcall_debugShows the debug messageWhen it calls and is returned from each initcall function, it prints a message with elapsed time

34static int __init_or_module do_one_initcall_debug(initcall_t fn){...

pr_debug("calling %pF @ %i\n", fn, task_pid_nr(current));calltime = ktime_get();ret = fn();rettime = ktime_get();...pr_debug("initcall %pF returned %d after %lld usecs\n", fn, ret, duration);...}(init/main.c)4. Multiprocessor InitializationWelcome to the world of concurrency!35How the multiple cores are started?Two types

36HW Power OnStart Linux kernelInitialize SMPCore 0Core 1Core 2Wake upWake upCore 0Core 1Core 2Wake upWake upStop & WaitStop & WaitHow the multiple cores are started?The first typex86, ARM, etc.(x86) The first processor (core) is determined by HW, and called the bootstrap processor (BSP). The remaining processor(s) (cores) are called application processor(s) (APs).The second typePowerPC (some models), etc.

37MP DetectionHow to detect the number of cores available in the hardware?Firmware InformationACPI MADT (Multiple APIC Description Table) (x86)SFI (Simple Firmware Interface) (Xeon Phi)MP Configuration Table (Very old x86)DeviceTree (ARM)Or hardcoded (ARM)Kernel boot parametersnosmpmaxcpus=Kernel configurationCONFIG_NR_CPUS38MP Bootingx86INIT IPIThe sequence of INIT, INIT, STARTUP IPI.NMI (For CPU0)This works to wake up soft offline CPU0 onlyARMenable-method node in the device treeDepends on the board (march)ARM64enable-method node in the device treespin-tableCores spin at some memory area (outside the kernel). When a value is written to the area, the core jumps to the written address.psci (Power State Coordination Interface)39AP InitializationAfter woken up, where will AP execute?X86First, trampoline codeSwitches from real-mode to the 32-bit or 64-bit modeLocated in the very low memory since the new core start in the real-modeThen, jump to the secondary entrypoint32-bit : startup_32_smp (arch/x86/kernel/head_32.S)64-bit : secondary_startup_64 (arch/x86/kernel/head_64.S)ARM64First, secondary_holding_pen (arch/arm64/kernel/head.S)After woken up, all the cores are held at this functionThen, secondary_startup

40AP Initialization (2)Initializes the CPU state for the new core in the assembler levelPaging onSome special registersThen, goes to the C codestart_secondary (in x86, arch/x86/kernel/smpboot.c)secondary_start_kernel (in ARM/ARM64, arch/arm{,64}/kernel/smp.c)Finally, it goes to the idle loop as the boot taskcpu_startup_entry41start_secondary (x86)42#FunctionCategoryDescription1cpu_initCPUVarious CPU states2x86_cpuinit.early_percpu_clock_init3smp_callinSMPNotify the BSP of the APs boot-up4check_tsc_sync_target5set_cpu_onlineSMPSet the cpu_online_mask6x86_platform.nmi_initCPU7boot_init_stack_canaryDebug8x86_cpuinit.setup_percpu_clockev9cpu_startup_entrysecondary_start_kernel (ARM64)43#FunctionCategoryDescription1(Set the current mm to init_mm)MM2set_my_cpu_offsetSMPSet per-cpu offset3cpu_set_reserved_ttbr0CPUSet TTBR0 to the zero page4cpu_ops[cpu]->cpu_postbootCPU5notify_cpu_starting6smp_store_cpu_info7set_cpu_online8completeNotify the boot CPU of the cores boot9cpu_startup_entryGo to the idle loop(Notes)Naming conventionsBP? BSP?Why some functions have e820_ as their prefixes but some do not?

44