Upload
stella-patrick
View
219
Download
6
Embed Size (px)
Citation preview
Operating System Design - Linux
Instructor: Ching-Chi Hsu
TA:Yung-Yu Chuang
Introduction to Linux (Nov. 1991, Linus Torvalds)
• Multi-tasking
• Demand loading & Copy On Write
• Paging (not swapping)
• Shared Libraries
• POSIX 1003.1
• Protected Mode
• Support different file systems and executable formats
Multitaskingrequire service require service
CPU idle CPU idle
require service require service
time interrupt for time-sharingrequire service
time expire
require service
• Based on i386 and Linux 2.0.33
• Topics– initialization– memory management (free space management, virt
ual memory management)– process management (context switching, schedulin
g)– system call
Resources for Tracing Linux
• http://odie.csie.ntu.edu.tw/~osd
• TLK, KHG, Linux Kernel Internals
• Source code browser
• Intel Programmer’s manual
Source Tree for Linux
/usr/src/linux
modules
fs
netkernel
init include ipclib
driversarch linux
asm-i386
asm-????
char
block
scsineti386
????
kernel boot mm
nfs
ext2
proc
….
..
How to compile Linux Kernel
1. make config (make manuconfig)2. make depend3. make boot (generate a compressed bootable linux kernel arch/i386/boot/zIamge) make zdisk (generate kernel and write to disk dd if=zImage of=/dev/fd0) make zlilo (generate kernel and copy to /vmlinuz)
lilo: Linux Loader
i386
• Segmented Addressing (segment:offset)
• Paging(Virtual Memory)
• Call Gate (Protection)
• TSS (Context Switching)
T I
GDTR LDTR
GDT LDT
INDEX
SELECTOR
desc desc
OFFSET
+
Linear Address
BASE LIMIT
BASE+LIMIT
BASE+8
BASE 15:0 LIMIT 15:0
BASE 31:24 AGD0 V L
LIMIT19:16 BASE 23:16TYPE
DP P S L
031
3263
Desc., Call gate, TSS
yyyyy000zzzzz000
CR3
ddd ttt ooo
4K page
zzzzzooo+
PTEPDE
Page Addr. P
Physical memory
Disk
Linear Address Space
4GBOS
3
210
Call Gate
Call TSS gate cause context switching
TSS Gate TSS desc.
CS,DS, ES…IPSP0, SP1,SP2, SP3CR3…..
in GDT
CPU
• #RESET– real-address mode– self-test– EAX contains error code– EDX contains CPU id– CR0
i386 Initialization
PG
PE
TS
EM
M P
RESERVED
0
EFLAGSEIPCS*DS**SSES**FSGSIDTR(base)IDTR(limit)DR7
0XXXX0002H0000FFF0H0F000H0000H0000H0000H0000H0000H00000000H03FFH0000H
Register State
* invisible part: 0FFFF0000(base) 0FFFF(limit)** invisible part: 0(base) 0FFFF(limit)
FFFF0H : ROM-BIOS address* do some test* initialize interrupt vector at physical address 0* load the first sector of a bootable device to 0x7C00 (boot/bootsect.S)* jump to 0x7C00 and run
Linux Kernel on Disk (vmlinux, 1,133,665 bytes)
bootsect.S Setup.S
1 sector 4 sectors
Self-extracted Kernel Image
Compressed Kernel Image (vmlinux.out, 455,321)
vmlinux (executable)
Decompressionmodule
/usr/src/linux/arch/i386/boot/zImage
boot disk
CPUA20
1M
A0000
I/O & BIOS
7C000
90000
IP
64K
0.5K bytes
7C000
Bootsect.S
BIOS load
IP 7C000
90000IP
bootsect.S
0.5K bytes
0.5K bytes
0.5K bytes7C000
90000IP
2K bytes
90200
Setup.S
0.5K bytes7C000
0.5K bytes90000
IP
2K bytes
90200
Setup.S
10000
508K bytes
0.5K bytes
vmlinux
SETUPSECS = 4 ! nr of setup-sectorsBOOTSEG = 0x07C0 ! original address of boot-sectorINITSEG = DEF_INITSEG ! we move boot here - out of the way 0x9000SETUPSEG = DEF_SETUPSEG ! setup starts here, 0x9020SYSSEG = DEF_SYSSEG ! system loaded at 0x10000 (65536)
< omitted>
mov ax,#BOOTSEG mov ds,ax mov ax,#INITSEG mov es,ax mov cx,#256 sub si,si sub di,di cld rep movsw
jmpi go,INITSEG ! Execute moved bootsectgo:
Copy bootsect.S to 0x90000
<omit>load_setup:
xor dx, dx ! drive 0, head 0 mov cl,#0x02 ! sector 2, track 0 mov bx,#0x0200 ! address = 512, in INITSEG mov ah,#0x02 ! service 2, nr of sectors mov al,setup_sects ! (assume all on head 0, track 0) ! Setup_sects=4 int 0x13 ! read it (BIOS routine) jnc ok_load_setup ! ok - continue
push ax ! dump error code call print_nl mov bp, sp call print_hex pop ax
jmp load_setupok_load_setup:
Try to load setup.S from(drive 0, head 0,sector 2, track 0)to memory 0x90200
<omit>! Print some inane message mov ah,#0x03 ! read cursor pos xor bh,bh int 0x10 mov cx,#9 mov bx,#0x0007 ! page 0, attribute 7 (normal) mov bp,#msg1 ! .byte 13,10 .ascii “Loading” mov ax,#0x1301 ! write string, move cursor int 0x10 ! BIOS routine
! ok, we've written the message, now! we want to load the system (at 0x10000) mov ax,#SYSSEG mov es,ax ! segment of 0x010000 call read_it ! Read 508K to 0x10000 (64K), one . per track call kill_motor ! Stop floopy motor call print_nl<omit> jmpi 0, SETUPSEG ! Jump to 0x90200 (setup.S)
Print “/nLoading”
setup.S
• Check memory size
• set keyboard, video adapter, get HD data
• switch to protected mode– set GDT– set IDT– set PE bit (flush pipe)
start: jmp start_of_setup! ------------------------ start of header --------------------------------!! SETUP-header, must start at CS:2 (old 0x9020:2)! .ascii "HdrS" ! Signature for SETUP-header .word 0x0201 ! Version number of header format ! (must be >= 0x0105 ! else old loadlin-1.5 will fail)
<omit>start_of_setup:
…………… (check signature)
good_sig: mov ax,cs ! aka #SETUPSEG sub ax,#DELTA_INITSEG ! aka #INITSEG mov ds,ax ! DS=9000
loader_ok:! Get memory size (extended mem, kB)
mov ah,#0x88 int 0x15 mov [2],ax ! Store memory size in 0x90002 (bootsect.S)
<omit>(disable interrupts)(move kernel image to 1000)
end_move_self: lidt idt_48 ! load idt with 0,0 lgdt gdt_48 ! load gdt with whatever appropriate
idt_48:.word 0.word 0, 0
gdt_48:.word 0x800.word 512+gdt, 0x9
BASE Limit
0,0 0idt_48
0x9, 512+gdt 0x800 (2048)gdt_48gdt: .word 0,0,0,0 ! dummy
.word 0,0,0,0 ! unused
.word 0xFFFF ! 4Gb - (0x100000*0x1000 = 4Gb) .word 0x0000 ! base address=0 .word 0x9A00 ! code read/exec .word 0x00CF ! granularity=4096, 386 (+5th nibble of limit)
.word 0xFFFF ! 4Gb - (0x100000*0x1000 = 4Gb) .word 0x0000 ! base address=0 .word 0x9200 ! data read/write .word 0x00CF ! granularity=4096, 386 (+5th nibble of limit)
BASE 15:0 LIMIT 15:0
BASE 31:24 AGD0 V L
LIMIT19:16 BASE 23:16TYPE
DP P S L
031
3263
null
Not used
code
data
BASE=0x00000000, LIMIT=FFFFFF G=1 (4G)DPL=0 type=1010 (code, non-conforming, r/x, not accessed)
BASE=0x00000000, LIMIT=FFFFFF G=1 (4G)DPL=0 type=1010 (code, non-conforming, r/x, not accessed)
! that was painless, now we enable A20, no wrapped
call empty_8042 mov al,#0xD1 ! command write out #0x64,al call empty_8042 mov al,#0xDF ! A20 on out #0x60,al call empty_8042
<omit>
mov ax,#1 ! protected mode (PE) bit lmsw ax ! This is it! Load into CR0 jmp flush_instr ! Flush pipeflush_instr: xor bx,bx ! Flag to indicate a boot
! NOTE: For high loaded big kernels we need a! jmpi 0x100000,KERNEL_CS!! but we yet haven't reloaded the CS register, so the default size ! of the target offset still is 16 bit.! However, using an operant prefix (0x66), the CPU will properly! take our 48 bit far pointer. (INTeL 80386 Programmer's Reference! Manual, Mixing 16-bit and 32-bit code, page 16-6) db 0x66,0xea ! prefix + jmpi-opcodecode32: dd 0x1000 ! will be set to 0x100000 for big kernels dw KERNEL_CS ! KERNEL=0x10
0 0 0001 0000
TI
RPL
15 2 1 0
INDEX
0:GDT 1:LDT
Decompress Kernelstartup_32: (gcc entry point) cld
cli movl $(KERNEL_DS),%eax # KERNEL_DS=0x18 mov %ax,%ds mov %ax,%es mov %ax,%fs mov %ax,%gs
<omit>
lss SYMBOL_NAME(stack_start),%esp xorl %eax,%eax1: incl %eax # check that A20 really IS enabled movl %eax,0x000000 # loop forever if it isn't cmpl %eax,0x100000 je 1b
( clear BSS )
/* * Do the decompression, and jump to the new kernel.. */ subl $16,%esp # place for structure on the stack pushl %esp # address of structure as first arg call SYMBOL_NAME(decompress_kernel) # decompress kernel to 100000 orl %eax,%eax # gunzip 1.0.3 jnz 3f xorl %ebx,%ebx ljmp $(KERNEL_CS), $0x100000 # jump to decompressed kernel
100000
101000
102000
103000
104000
105000
106000
swapper_pg_dir
pg0
empty_bad_page
empty_bad_page_table
empty_zero_page
stack
idtgdt
EIP
head.S
(copy parameters from 0x90000)
100000
101000
102000
103000
104000
105000
106000
PG_DIR
PG0
empty_bad_page
empty_bad_page_table
empty_zero_page
stack
idtgdt
CR3
0
768 4M
Physical Memory
Setup Paging Table & Enable Paging
100000
101000
102000
103000
104000
105000
106000
PG_DIR
PG0
empty_bad_page
empty_bad_page_table
empty_zero_page
stack
idtgdtGDTR
NULL0
00
2*NR_TASKS
C0000000 1G DPL=0 codeC0000000 1G DPL=0 data00000000 3G DPL=3 code00000000 3G DPL=3 data
0x100x180x230x2b
Setup GDT
100000
101000
102000
103000
104000
105000
106000
PG_DIR
PG0
empty_bad_page
empty_bad_page_table
empty_zero_page
stack
idtgdt
255
0 GDT
ignore_int
IDTR
Setup IDT
call setup_paging
setup_paging: movl $1024*2,%ecx /* 2 pages - swapper_pg_dir+1 page table */ xorl %eax,%eax movl $ SYMBOL_NAME(swapper_pg_dir),%edi /* swapper_pg_dir is at 0x1000 */ cld;rep;stosl/* Identity-map the kernel in low 4MB memory for ease of transition *//* set present bit/user r/w */ movl $ SYMBOL_NAME(pg0)+7,SYMBOL_NAME(swapper_pg_dir)/* But the real place is at 0xC0000000 *//* set present bit/user r/w */ movl $ SYMBOL_NAME(pg0)+7,SYMBOL_NAME(swapper_pg_dir)+3072 movl $ SYMBOL_NAME(pg0)+4092,%edi movl $0x03ff007,%eax /* 4Mb - 4096 + 7 (r/w user,p) */ std1: stosl /* fill the page backwards - more efficient :-) */ subl $0x1000,%eax jge 1b cld
movl $ SYMBOL_NAME(swapper_pg_dir),%eax movl %eax,%cr3 /* cr3 - page directory start */ movl %cr0,%eax orl $0x80000000,%eax movl %eax,%cr0 /* set paging (PG) bit */ ret /* this also flushes the prefetch-queue */
31 12 6 5 2 1 0
Page Address D AU /S
R /W
P
Format of PDE & PTE
lgdt gdt_descr
gdt_descr: .word (8+2*NR_TASKS)*8-1 .long 0xc0000000+SYMBOL_NAME(gdt)
ENTRY(gdt) .quad 0x0000000000000000 /* NULL descriptor */ .quad 0x0000000000000000 /* not used */ .quad 0xc0c39a000000ffff /* 0x10 kernel 1GB code at 0xC0000000 */ .quad 0xc0c392000000ffff /* 0x18 kernel 1GB data at 0xC0000000 */ .quad 0x00cbfa000000ffff /* 0x23 user 3GB code at 0x00000000 */ .quad 0x00cbf2000000ffff /* 0x2b user 3GB data at 0x00000000 */ .quad 0x0000000000000000 /* not used */ .quad 0x0000000000000000 /* not used */ .fill 2*NR_TASKS,8,0 /* space for LDT's and TSS's etc */
(setup data segments and clear BSS)call setup_idt
setup_idt: lea ignore_int,%edx movl $(KERNEL_CS << 16),%eax movw %dx,%ax /* selector = 0x0010 = cs */ movw $0x8E00,%dx /* interrupt gate - dpl=0, present */
lea SYMBOL_NAME(idt),%edi mov $256,%ecxrp_sidt: movl %eax,(%edi) movl %edx,4(%edi) addl $8,%edi dec %ecx jne rp_sidt ret
SELECTOR OFFSET
OFFSET 8 E 0 0
interrupt gate
ignore_int: just print “Unknown Interrupt”
lidt idt_descr ljmp $(KERNEL_CS),$1f1: movl $(KERNEL_DS),%eax # reload all the segment registers mov %ax,%ds # after changing gdt. mov %ax,%es mov %ax,%fs mov %ax,%gs
call SYMBOL_NAME(start_kernel) # jump to C main routine
start_kernelasmlinkage void start_kernel(void) {
setup_arch(&command_line, &memory_start, &memory_end); memory_start = paging_init(memory_start,memory_end); trap_init(); init_IRQ();
<-------------- omit ---------------->
memory_start = console_init(memory_start,memory_end);
memory_start = kmalloc_init(memory_start,memory_end); sti(); # enable interrupt
memory_start = inode_init(memory_start,memory_end); memory_start = file_table_init(memory_start,memory_end); memory_start = name_cache_init(memory_start,memory_end);
mem_init(memory_start,memory_end);
<---------- omit ------------->
printk(linux_banner);
sysctl_init(); kernel_thread(init, NULL, 0); cpu_idle(NULL);}
setup_arch
1M
kernelmemory_start
memory_start = (unsigned long) &_end;
memory_end
memory_end = (1<<20) + (EXT_MEM_K<<10); memory_end &= PAGE_MASK;
#define PARAM empty_zero_page#define EXT_MEM_K (*(unsigned short *) (PARAM+2))
init_task.mm->start_code = TASK_SIZE; /* 0xC0000000 */ init_task.mm->end_code = TASK_SIZE + (unsigned long) &_etext; init_task.mm->end_data = TASK_SIZE + (unsigned long) &_edata; init_task.mm->brk = TASK_SIZE + (unsigned long) &_end;
/ * "mem=XXX[kKmM]" overrides the BIOS-reported memory size */
if (c == ' ' && *(const unsigned long *)from == *(const unsigned long *)"mem=")
memory_end = simple_strtoul(from+4, &from, 0); if ( *from == 'K' || *from == 'k' ) { memory_end = memory_end << 10; from++; } else if ( *from == 'M' || *from == 'm' ) { memory_end = memory_end << 20; from++; }
paging_init
1M
kernelpg_dir
pg0
memory_startpg1
pg2
pgn01
768769
pg0pg1pg2
pgn
n
4M
4M
start_mem = PAGE_ALIGN(start_mem); address = 0; pg_dir = swapper_pg_dir; while (address < end_mem) {
/* map the memory at virtual addr 0xC0000000 */ pg_table = (pte_t *) (PAGE_MASK & pgd_val(pg_dir[768])); if (!pg_table) { pg_table = (pte_t *) start_mem; start_mem += PAGE_SIZE; }
/* also map it temporarily at 0x0000000 for init */ pgd_val(pg_dir[0]) = _PAGE_TABLE | (unsigned long) pg_table; pgd_val(pg_dir[768]) = _PAGE_TABLE | (unsigned long) pg_table; pg_dir++;
for (tmp = 0 ; tmp < PTRS_PER_PTE ; tmp++,pg_table++) { if (address < end_mem) set_pte(pg_table, mk_pte(address, PAGE_SHARED)); else pte_clear(pg_table); address += PAGE_SIZE; } } local_flush_tlb(); /* move cr3, r?; mov r?, cr3; */ return free_area_init(start_mem, end_mem);
free_area_init
1. Set min_free_pages2. Initialize swap cache3. Mark all pages reserved4. Initialize Buddy system for free memory management
Free Memory Management (Tanenbaum)• Bitmap
• Linked list (first-fit, next-fit, best-fit, quick-fit)
0 2 4 6 8 10 12 14 16
0011000011100100
P 0 2 H 2 2 P 4 4 H 8 3
P 11 2 H13 1 P 14 2
Buddy System
A
B
C
A
A B
B
B D
D
C
C
C
C
Initialization
request A (2)
request B (1)
request C (2)
free A*
request D (1)
free B
free D
free C
0 2 4 6 8 10 12 14 16page
B
0
1
0
0
0
0
0
00
0
1
1
0
0
0
1
2
3
free_area
0 1 2 3 4 5 6 7 8 9101112131415
mem_map
8
0 6
3
C
Request D (1)
0
0
0
0
0
0
0
00
0
1
1
0
0
0
1
2
3
free_area
0 1 2 3 4 5 6 7 8 9101112131415
mem_map
8
0 6
C
BD
Free B
0
1
0
0
0
0
0
00
0
1
1
0
0
0
1
2
3
free_area
0 1 2 3 4 5 6 7 8 9101112131415
mem_map
8
0 6
C
D
2
Free D
0
0
0
0
0
0
0
00
0
1
0
1
0
0
1
2
3
free_area
0 1 2 3 4 5 6 7 8 9101112131415
mem_map
8
6
C0
Free C
0
0
0
0
0
0
0
00
0
0
0
0
0
0
1
2
3
free_area
0 1 2 3 4 5 6 7 8 9101112131415
mem_map
80
Request 2
0
0
0
0
0
0
0
00
0
0
1
1
0
0
1
2
3
free_area
0 1 2 3 4 5 6 7 8 9101112131415
mem_map
8
4
2
Kernel
pg1-pgn
swap cache
mem_map
free_area[].bitmap
start_mem
(4 bytes per page)
typedef struct page { /* these must be first (free area handling) */ struct page *next; struct page *prev; struct inode *inode; unsigned long offset; ……….. atomic_t count; unsigned flags; unsigned dirty:16, age:8; ……... unsigned long map_nr; /* page->map_nr == page - mem_map */} mem_map_t;
0
0
0
0
0
0
0
00
0
0
0
0
0
0
1
2
3
free_area
0 1 2 3 4 5 6 7 8 9101112131415
mem_map
unsigned long free_area_init(unsigned long start_mem, unsigned long end_mem){
/* * select nr of pages we try to keep free for important stuff * with a minimum of 48 pages. This is totally arbitrary */ i = (end_mem - PAGE_OFFSET) >> (PAGE_SHIFT+7); if (i < 24) i = 24; i += 24; /* The limit for buffer pages in __get_free_pages is * decreased by 12+(i>>3) */ min_free_pages = i;
start_mem = init_swap_cache(start_mem, end_mem); mem_map = (mem_map_t *) start_mem; p = mem_map + MAP_NR(end_mem); start_mem = LONG_ALIGN((unsigned long) p); memset(mem_map, 0, start_mem - (unsigned long) mem_map);
do { --p; p->flags = (1 << PG_DMA) | (1 << PG_reserved); p->map_nr = p - mem_map; } while (p > mem_map); /* 6 */ for (i = 0 ; i < NR_MEM_LISTS ; i++) { unsigned long bitmap_size; init_mem_queue(free_area+i); mask += mask; /* mask *=2 */ end_mem = (end_mem + ~mask) & mask; /* should be i+1 */ bitmap_size = (end_mem - PAGE_OFFSET) >> (PAGE_SHIFT + i); bitmap_size = (bitmap_size + 7) >> 3; bitmap_size = LONG_ALIGN(bitmap_size); free_area[i].map = (unsigned int *) start_mem; memset((void *) start_mem, 0, bitmap_size); start_mem += bitmap_size; } return start_mem;}
trap_init
1. Setup interrupt routines2. Int 0x80 for system call3. Setup TSS and LDT in GDT for each task
486 Exceptions
0 Fault Divided by Zero1 Fault Debug…..0B Fault Not Present…..0D Fault General Protection0E Fault Page Fault
…..
20-FF Int/Trap Used for OS
void trap_init(void){ set_call_gate(&default_ldt,lcall7); set_trap_gate(0,÷_error); set_trap_gate(1,&debug); set_trap_gate(2,&nmi); set_system_gate(3,&int3); /* int3-5 can be called from all */ set_system_gate(4,&overflow); set_system_gate(5,&bounds); set_trap_gate(6,&invalid_op); set_trap_gate(7,&device_not_available); set_trap_gate(8,&double_fault); set_trap_gate(9,&coprocessor_segment_overrun); set_trap_gate(10,&invalid_TSS); set_trap_gate(11,&segment_not_present); set_trap_gate(12,&stack_segment); set_trap_gate(13,&general_protection); set_trap_gate(14,&page_fault); set_trap_gate(15,&spurious_interrupt_bug); set_trap_gate(16,&coprocessor_error); set_trap_gate(17,&alignment_check);
for (i=18;i<48;i++) set_trap_gate(i,&reserved); set_system_gate(0x80,&system_call); /* set up GDT task & ldt entries */ p = gdt+FIRST_TSS_ENTRY; set_tss_desc(p, &init_task.tss); /* init_task: hardwired task #0 */ p++; set_ldt_desc(p, &default_ldt, 1); p++;
for(i=1 ; i<NR_TASKS ; i++) { p->a=p->b=0; p++; p->a=p->b=0; p++; }
set_call_gate(a, addr) set_gate(a, 12, 3, addr)
set_trap_gate(n, addr) set_gate(&idt[n], 15, 0, addr)
set_system_gate(n, addr) set_gate(&idt[n], 15, 3, addr)
set_intr_gate(n, addr) set_gate(&idt[n], 14, 0, addr)
#define _set_gate(gate_addr,type,dpl,addr) \__asm__ __volatile__ ("movw %%dx,%%ax\n\t" \ "movw %2,%%dx\n\t" \ "movl %%eax,%0\n\t" \ "movl %%edx,%1" \ :"=m" (*((long *) (gate_addr))), \ "=m" (*(1+(long *) (gate_addr))) \ :"i" ((short) (0x8000+(dpl<<13)+(type<<8))), \ "d" ((char *) (addr)),"a" (KERNEL_CS << 16) \ :"ax","dx")
SEGMENT SELECTOR OFFSET 15:0
OFFSET 31:24 DP P L
031
3263
TYPE 000 RESERVED
Descriptor in IDT
mem_init
• Reserve kernel and I/O pages
• Return all unused pages to buddy system
pg1-pgn
swap_cache
mem_map
free_area[].map
Console,PCI & FS
end_text
reserved
0x100000
0xA0000
data
code
start_mem
high_mem
start_low_mem4K
void mem_init(unsigned long start_mem, unsigned long end_mem){ end_mem &= PAGE_MASK; high_memory = end_mem;
/* mark usable pages in the mem_map[] */ start_low_mem = PAGE_ALIGN(start_low_mem);
start_mem = PAGE_ALIGN(start_mem);
/* * IBM messed up *AGAIN* in their thinkpad: 0xA0000 -> 0x9F000. * They seem to have done something stupid with the floppy * controller as well.. */ while (start_low_mem < 0x9f000) { clear_bit(PG_reserved, &mem_map[MAP_NR(start_low_mem)].flags); start_low_mem += PAGE_SIZE; }
while (start_mem < high_memory) { clear_bit(PG_reserved, &mem_map[MAP_NR(start_mem)].flags); start_mem += PAGE_SIZE; }
for (tmp = 0 ; tmp < high_memory ; tmp += PAGE_SIZE) { if (tmp >= MAX_DMA_ADDRESS) /* 16M */ clear_bit(PG_DMA, &mem_map[MAP_NR(tmp)].flags); if (PageReserved(mem_map+MAP_NR(tmp))) { if (tmp >= 0xA0000 && tmp < 0x100000) reservedpages++; else if (tmp < (unsigned long) &_etext) codepages++; else datapages++; continue; } mem_map[MAP_NR(tmp)].count = 1;
free_page(tmp); }
tmp = nr_free_pages << PAGE_SHIFT;
printk("Memory: %luk/%luk available (%dk kernel code, %dk reserved, %dk data)\n", tmp >> 10, high_memory >> 10, codepages << (PAGE_SHIFT-10), reservedpages << (PAGE_SHIFT-10), datapages << (PAGE_SHIFT-10));
return;}
#define free_page(addr) free_pages((addr),0)
void free_pages(unsigned long addr, unsigned long order){ unsigned long map_nr = MAP_NR(addr);
if (map_nr < MAP_NR(high_memory)) { mem_map_t * map = mem_map + map_nr; if (PageReserved(map)) return; if (atomic_dec_and_test(&map->count)) { delete_from_swap_cache(map_nr); free_pages_ok(map_nr, order); return; } }}
static inline void free_pages_ok(unsigned long map_nr, unsigned long order){ struct free_area_struct *area = free_area + order; unsigned long index = map_nr >> (1 + order); unsigned long mask = (~0UL) << order;
cli();
#define list(x) (mem_map+(x)) map_nr &= mask;
nr_free_pages -= mask; /* -mask = 1+~mask */ while (mask + (1 << (NR_MEM_LISTS-1))) { if (!change_bit(index, area->map) ) break; remove_mem_queue(list(map_nr ^ -mask)); /* neighbor */ mask <<= 1; area++; index >>= 1; map_nr &= mask; } add_mem_queue(area, list(map_nr));#undef list}
extern inline unsigned long get_free_page(int priority){ unsigned long page;
page = __get_free_page(priority); if (page) memset((void *) page, 0, PAGE_SIZE); return page;}
#define __get_free_page(priority) __get_free_pages((priority),0,0)
unsigned long __get_free_pages(int priority, unsigned long order, int dma){ unsigned long flags; int reserved_pages;
if (order >= NR_MEM_LISTS) return 0; if (intr_count && priority != GFP_ATOMIC) { static int count = 0; if (++count < 5) { printk("gfp called nonatomically from interrupt %p\n", __builtin_return_address(0)); priority = GFP_ATOMIC; } } reserved_pages = 5; if (priority != GFP_NFS) reserved_pages = min_free_pages; if ((priority == GFP_BUFFER || priority == GFP_IO) && reserved_pages >= 48) reserved_pages -= (12 + (reserved_pages>>3)); save_flags(flags);
repeat: cli(); if ((priority==GFP_ATOMIC) || nr_free_pages > reserved_pages) { RMQUEUE(order, dma); restore_flags(flags); return 0; } restore_flags(flags); if (priority != GFP_BUFFER && try_to_free_page(priority, dma, 1)) goto repeat; return 0;}
/* * Some ugly macros to speed up __get_free_pages().. */#define MARK_USED(index, order, area) \ change_bit((index) >> (1+(order)), (area)->map)#define CAN_DMA(x) (PageDMA(x))#define ADDRESS(x) (PAGE_OFFSET + ((x) << PAGE_SHIFT))
#define RMQUEUE(order, dma) \do { struct free_area_struct * area = free_area+order; \ unsigned long new_order = order; \ do { struct page *prev = memory_head(area), *ret; \ while (memory_head(area) != (ret = prev->next)) { \ if (!dma || CAN_DMA(ret)) { \ unsigned long map_nr = ret->map_nr; \ (prev->next = ret->next)->prev = prev; \ MARK_USED(map_nr, new_order, area); \ nr_free_pages -= 1 << order; \ EXPAND(ret, map_nr, order, new_order, area); \ restore_flags(flags); \ return ADDRESS(map_nr); \ } \ prev = ret; \ } \ new_order++; area++; \ } while (new_order < NR_MEM_LISTS); \} while (0)
#define EXPAND(map,index,low,high,area) \do { unsigned long size = 1 << high; \ while (high > low) { \ area--; high--; size >>= 1; \ add_mem_queue(area, map); \ MARK_USED(index, high, area); \ index += size; \ map += size; \ } \ map->count = 1; \ map->age = PAGE_INITIAL_AGE; \} while (0)
kernel_threadcall sys_clone();
if (StackIsChanged() /* new process */) { call fn(args); sys_exit();} else { /* do nothing */ /* task[0] goes through here*/}
CPU_idle()
sys_idle()
schedule()
static inline pid_t kernel_thread(int (*fn)(void *), void * arg, unsigned long flags){ long retval;
__asm__ __volatile__( "movl %%esp,%%esi\n\t" "int $0x80\n\t" /* Linux/i386 system call */ "cmpl %%esp,%%esi\n\t" /* child or parent? */ "je 1f\n\t" /* parent - jump */ "pushl %3\n\t" /* push argument */ "call *%4\n\t" /* call fn */ "movl %2,%0\n\t" /* exit */ "int $0x80\n" "1:\t" :"=a" (retval) :"0" (__NR_clone), "i" (__NR_exit), "r" (arg), "r" (fn), "b" (flags | CLONE_VM) :"si"); return retval;}
System Calls/* * This file contains the system call numbers. Unistd.h */
#define __NR_setup 0 /* used only by init, to get system going */#define __NR_exit 1#define __NR_fork 2#define __NR_read 3#define __NR_write 4#define __NR_open 5……..#define __NR_clone 120……..#define __NR_sched_rr_get_interval 161#define __NR_nanosleep 162#define __NR_mremap 163
.data /* entry.S */ENTRY(sys_call_table) .long SYMBOL_NAME(sys_setup) /* 0 */ .long SYMBOL_NAME(sys_exit) .long SYMBOL_NAME(sys_fork) .long SYMBOL_NAME(sys_read) .long SYMBOL_NAME(sys_write) .long SYMBOL_NAME(sys_open) /* 5 */…….. .long SYMBOL_NAME(sys_clone) /* 120 */…….. .long SYMBOL_NAME(sys_sched_rr_get_interval) .long SYMBOL_NAME(sys_nanosleep) .long SYMBOL_NAME(sys_mremap) .long 0,0 .long SYMBOL_NAME(sys_vm86) .space (NR_syscalls-166)*4 /* 256 */
Pseudo Code for System Call
if (sys_call_num >= NR_syscalls) return -ENOSYS;else { if (sys_call_table[sys_call_sum]==NULL) return -ENOSYS; if (PF_TRACESYS) { syscall_trace(); call sys_call_table[sys_call_num]; syscall_trace(); } else call sys_call_table[sys_call_num];
ENTRY(system_call) pushl %eax # save orig_eax, for syscall_trace (strace) SAVE_ALL
0(%esp) - %ebx 4(%esp) - %ecx 8(%esp) - %edx C(%esp) - %esi 10(%esp) - %edi 14(%esp) - %ebp # SAVE_ALL 18(%esp) - %eax 1C(%esp) - %ds 20(%esp) - %es 24(%esp) - %fs 28(%esp) - %gs 2C(%esp) - orig_eax # pushl %eax 30(%esp) - %eip 34(%esp) - %cs # push by CPU, int 0x80 38(%esp) - %eflags 3C(%esp) - %oldesp # push by CPU, stack switching 40(%esp) - %oldss
STACK
movl $-ENOSYS,EAX(%esp) cmpl $(NR_syscalls),%eax # EAX=SYS_CALL_NUM jae ret_from_sys_call movl SYMBOL_NAME(sys_call_table)(,%eax,4),%eax testl %eax,%eax je ret_from_sys_call
…….. testb $0x20,flags(%ebx) # PF_TRACESYS jne 1f call *%eax movl %eax,EAX(%esp) # save the return value jmp ret_from_sys_call ALIGN1: call SYMBOL_NAME(syscall_trace) movl ORIG_EAX(%esp),%eax call SYMBOL_NAME(sys_call_table)(,%eax,4) movl %eax,EAX(%esp) # save the return value
call SYMBOL_NAME(syscall_trace)
sys_cloneasmlinkage int sys_clone(struct pt_regs regs){ unsigned long clone_flags; unsigned long newsp;
clone_flags = regs.ebx; newsp = regs.ecx; if (!newsp) newsp = regs.esp; return do_fork(clone_flags, newsp, ®s);}
do_fork
• Copy process structure from parent
int do_fork(unsigned long clone_flags, unsigned long usp, struct pt_regs *regs){ int nr; int error = -ENOMEM; unsigned long new_stack; struct task_struct *p;
p = (struct task_struct *) kmalloc(sizeof(*p), GFP_KERNEL); if (!p) goto bad_fork; new_stack = alloc_kernel_stack(); /* get_free_page(GFP_KERNEL) */ if (!new_stack) goto bad_fork_free_p; error = -EAGAIN; nr = find_empty_process(); if (nr < 0) goto bad_fork_free_stack;
*p = *current;
if (p->exec_domain && p->exec_domain->use_count) (*p->exec_domain->use_count)++; if (p->binfmt && p->binfmt->use_count) (*p->binfmt->use_count)++;
p->did_exec = 0; p->swappable = 0; p->kernel_stack_page = new_stack; *(unsigned long *) p->kernel_stack_page = STACK_MAGIC; p->state = TASK_UNINTERRUPTIBLE; p->flags &= ~(PF_PTRACED|PF_TRACESYS|PF_SUPERPRIV); p->flags |= PF_FORKNOEXEC; p->pid = get_pid(clone_flags); p->next_run = NULL; p->prev_run = NULL; p->p_pptr = p->p_opptr = current; p->p_cptr = NULL; init_waitqueue(&p->wait_chldexit); p->signal = 0;
p->it_real_value = p->it_virt_value = p->it_prof_value = 0; p->it_real_incr = p->it_virt_incr = p->it_prof_incr = 0; init_timer(&p->real_timer); p->real_timer.data = (unsigned long) p; p->leader = 0; /* session leadership doesn't inherit */ p->tty_old_pgrp = 0; p->utime = p->stime = 0; p->cutime = p->cstime = 0;
p->start_time = jiffies; task[nr] = p; SET_LINKS(p); nr_tasks++;
error = -ENOMEM; /* copy all the process information */ if (copy_files(clone_flags, p)) goto bad_fork_cleanup; if (copy_fs(clone_flags, p)) goto bad_fork_cleanup_files;
if (copy_sighand(clone_flags, p)) goto bad_fork_cleanup_fs; if (copy_mm(clone_flags, p)) goto bad_fork_cleanup_sighand; copy_thread(nr, clone_flags, usp, p, regs); p->semundo = NULL;
/* ok, now we should be set up.. */ p->swappable = 1; p->exit_signal = clone_flags & CSIGNAL; p->counter = (current->counter >>= 1); wake_up_process(p); /* state=TASK_RUNNING insert into run_queue */ ++total_forks; return p->pid; /* error handler */}
Process’s Virtual Memory
mm
Process’s Virtual Memory
countpgd
mmapmmap_avlmmap_sem
mm_struct
task_struct
vm_endvm_startvm_flagsvm_inodevm_ops
vm_next
vm_endvm_startvm_flagsvm_inodevm_ops
vm_next
vm_area_struct
code
data
nopagewppageswapout….
struct mm_struct { int count; pgd_t * pgd; unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack, start_mmap; unsigned long arg_start, arg_end, env_start, env_end; unsigned long rss, total_vm, locked_vm; unsigned long def_flags; struct vm_area_struct * mmap; struct vm_area_struct * mmap_avl; struct semaphore mmap_sem;};#define INIT_MM { \ 1, \ swapper_pg_dir, \ 0, 0, 0, 0, \ 0, 0, 0, 0, \ 0, 0, 0, 0, \ 0, 0, 0, \ 0, \ &init_mmap, &init_mmap, MUTEX }
struct vm_area_struct { struct mm_struct * vm_mm; /* VM area parameters */ unsigned long vm_start; unsigned long vm_end; pgprot_t vm_page_prot; unsigned short vm_flags;/* AVL tree of VM areas per task, sorted by address */ short vm_avl_height; struct vm_area_struct * vm_avl_left; struct vm_area_struct * vm_avl_right;/* linked list of VM areas per task, sorted by address */ struct vm_area_struct * vm_next;/* more */ struct vm_operations_struct * vm_ops; unsigned long vm_offset; struct inode * vm_inode; unsigned long vm_pte; /* shared mem */};
#define INIT_MMAP { &init_mm, 0, 0x40000000, PAGE_SHARED, VM_READ | VM_WRITE | VM_EXEC }
copy_thread
Copy TSS from parent and set some private fields
void copy_thread(int nr, unsigned long clone_flags, unsigned long esp, struct task_struct * p, struct pt_regs * regs){ int i; struct pt_regs * childregs;
p->tss.es = KERNEL_DS; p->tss.cs = KERNEL_CS; p->tss.ss = KERNEL_DS; p->tss.ds = KERNEL_DS; p->tss.fs = USER_DS; p->tss.gs = KERNEL_DS; p->tss.ss0 = KERNEL_DS; p->tss.esp0 = p->kernel_stack_page + PAGE_SIZE; p->tss.tr = _TSS(nr); childregs = ((struct pt_regs *) (p->kernel_stack_page + PAGE_SIZE)) - 1; p->tss.esp = (unsigned long) childregs; p->tss.eip = (unsigned long) ret_from_sys_call; *childregs = *regs;
childregs->eax = 0; childregs->esp = esp; p->tss.back_link = 0; p->tss.eflags = regs->eflags & 0xffffcfff; /* iopl is always 0 for a new process */ p->tss.ldt = _LDT(nr); set_tss_desc(gdt+(nr<<1)+FIRST_TSS_ENTRY,&(p->tss));
p->tss.bitmap = offsetof(struct thread_struct,io_bitmap); for (i = 0; i < IO_BITMAP_SIZE+1 ; i++) /* IO bitmap is actually SIZE+1 */ p->tss.io_bitmap[i] = ~0;}
ret_from_sys_call
• All slow interrupts and system calls end here
ret_from_sys_call: cmpl $0,SYMBOL_NAME(intr_count) /* handle interrupts */ jne 2f9: movl SYMBOL_NAME(bh_mask),%eax andl SYMBOL_NAME(bh_active),%eax jne handle_bottom_half
1: sti cmpl $0,SYMBOL_NAME(need_resched) /* to see if we need reschedule*/ jne reschedule ………….
2: RESTORE_ALL
#define RESTORE_ALL \ ………….. popl %ebx; \ popl %ecx; \ popl %edx; \ popl %esi; \ popl %edi; \ popl %ebp; \ popl %eax; \ pop %ds; \ pop %es; \ pop %fs; \ pop %gs; \ addl $4,%esp; \ iret
schedule
• Task->count: dynamic priority
• Task->priority: static priority
• time interrupt: (100Hz)
jiffies++
if (current->count <= 0)
need_resched=1;
• run queue: links all RUNNABLE tasks
asmlinkage void schedule(void){ int c; struct task_struct * p; struct task_struct * prev, * next; unsigned long timeout = 0;
/* check alarm, wake up any interruptible tasks that have got a signal */
allow_interrupts();
if (intr_count) goto scheduling_in_interrupt;
if (bh_active & bh_mask) { intr_count = 1; do_bottom_half(); intr_count = 0; }
need_resched = 0; prev = current; cli(); /* move an exhausted RR process to be last.. */ if (!prev->counter && prev->policy == SCHED_RR) { prev->counter = prev->priority; move_last_runqueue(prev); } …………. p = init_task.next_run; sti(); c = -1000; next = idle_task; while (p != &init_task) { int weight = goodness(p, prev, this_cpu); if (weight > c) c = weight, next = p; p = p->next_run; }
/* if all runnable processes have "counter == 0", re-calculate counters */ if (!c) { for_each_task(p) p->counter = (p->counter >> 1) + p->priority; } if (prev != next) { kstat.context_swtch++; ………….. switch_to(prev,next); } return;}
#define switch_to(prev,next) do { \__asm__("movl %2,"SYMBOL_NAME_STR(current_set)"\n\t" \ "ljmp %0\n\t" \ …………….. : /* no outputs */ \ :"m" (*(((char *)&next->tss.tr)-4)), \ "r" (prev), "r" (next)); \} while (0)
process #1
int 80
system_call
ret_from_sys_call
need_reschedschedule
switch_to
return ret_from_sys_call
iret
process #2
Process Switching
Page FaultWhen page fault occurs:
error_codeEIPCSEFLAGSold ESPold SS
U /S
W / R
P
CR2: contains fault address
Jump to interrupt handlingroutine for int 0x0Estack
ENTRY(page_fault) pushl $ SYMBOL_NAME(do_page_fault) jmp error_code
0(%esp) - %ebx 4(%esp) - %ecx 8(%esp) - %edx C(%esp) - %esi 10(%esp) - %edi 14(%esp) - %ebp # pushl ….. 18(%esp) - %eax 1C(%esp) - %ds 20(%esp) - %es 24(%esp) - %fs 28(%esp) - %gs 2C(%esp) - orig_eax # error_code pushed by CPU 30(%esp) - %eip 34(%esp) - %cs # push by CPU, int 0x80 38(%esp) - %eflags 3C(%esp) - %oldesp # push by CPU, stack switching 40(%esp) - %oldss
STACK
# addr. of do_page_fault
error_code: push %fs push %es push %ds pushl %eax xorl %eax,%eax pushl %ebp pushl %edi pushl %esi pushl %edx decl %eax # eax = -1 pushl %ecx pushl %ebx cld xorl %ebx,%ebx # zero ebx xchgl %eax, ORIG_EAX(%esp) # orig_eax (get the error code. ) mov %gs,%bx # get the lower order bits of gs movl %esp,%edx xchgl %ebx, GS(%esp) # get the address and save gs. pushl %eax # push the error code (argument) pushl %edx
movl $(KERNEL_DS),%edx mov %dx,%ds mov %dx,%es movl $(USER_DS),%edx mov %dx,%fs
movl SYMBOL_NAME(current_set),%eax
call *%ebx # call do_page_fault
addl $8,%esp # make a similar stack as system call
jmp ret_from_sys_call
do_page_fault
• This routine handles page faults. It determines the address, and the problem, and then passes it off to one of the appropriate routines.
• error_code:
bit 0 == 0 means no page found,
1 means protection fault
bit 1 == 0 means read, 1 means write
bit 2 == 0 means kernel, 1 means user-mode
asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code){ void (*handler)(struct task_struct *, struct vm_area_struct *, unsigned long, int); struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; struct vm_area_struct * vma; ….
/* get the address */ __asm__("movl %%cr2,%0":"=r" (address)); vma = find_vma(mm, address); if (!vma) goto bad_area; if (vma->vm_start <= address) goto good_area; …...
/* * Something tried to access memory that isn't in our memory map.. * Fix it, but check if it's kernel or user first.. */bad_area: if (error_code & 4) { /* user mode, kill it */ tsk->tss.cr2 = address; tsk->tss.error_code = error_code; tsk->tss.trap_no = 14; force_sig(SIGSEGV, tsk); return; }
…...}
good_area: handler = do_no_page; switch (error_code & 3) { default: /* 3: write, present */ handler = do_wp_page; /* fall through */ case 2: /* write, not present */ if (!(vma->vm_flags & VM_WRITE)) goto bad_area; break; case 1: /* read, present */ goto bad_area; case 0: /* read, not present */ if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto bad_area; } handler(tsk, vma, address, write); .….. return;
not present present
write check if you can writedo_no_page do_wp_page
read check if you bad_area can read do_no_page
do_no_page1. Address is present in memory, just return2. Address in swap area, call so_swap_page to swap it in
cr3
tskpage
disk
3. If no nopage routine is defined in the vm_area_struct, get a free page and link. (uninitialized data)
4. If a nopage routine is defined in the vm_area_struct, call it (file_mmap_nopage, tries to share pages with other tasks)
cr3
tskpage
get_free_page
do_wp_page1. Address not present, return2. Page is PAGE_RW, return3. If the page is referenced by only one task (count==1), make it PAGE_RW.4. If the page is referenced by more than one task, copy a new page and make it PAGE_RW.
cr3
tsk1 page
cr3
tsk
New pageset PAGE_RW
copy