122
Operating System Desig n - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Embed Size (px)

Citation preview

Page 1: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Operating System Design - Linux

Instructor: Ching-Chi Hsu

TA:Yung-Yu Chuang

Page 2: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Introduction to Linux (Nov. 1991, Linus Torvalds)

• Multi-tasking

• Demand loading & Copy On Write

• Paging (not swapping)

• Shared Libraries

• POSIX 1003.1

• Protected Mode

• Support different file systems and executable formats

Page 3: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Multitaskingrequire service require service

CPU idle CPU idle

require service require service

time interrupt for time-sharingrequire service

time expire

require service

Page 4: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

• Based on i386 and Linux 2.0.33

• Topics– initialization– memory management (free space management, virt

ual memory management)– process management (context switching, schedulin

g)– system call

Page 5: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Resources for Tracing Linux

• http://odie.csie.ntu.edu.tw/~osd

• TLK, KHG, Linux Kernel Internals

• Source code browser

• Intel Programmer’s manual

Page 6: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Source Tree for Linux

/usr/src/linux

modules

fs

netkernel

init include ipclib

driversarch linux

asm-i386

asm-????

char

block

scsineti386

????

kernel boot mm

nfs

ext2

proc

….

..

Page 7: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

How to compile Linux Kernel

1. make config (make manuconfig)2. make depend3. make boot (generate a compressed bootable linux kernel arch/i386/boot/zIamge) make zdisk (generate kernel and write to disk dd if=zImage of=/dev/fd0) make zlilo (generate kernel and copy to /vmlinuz)

lilo: Linux Loader

Page 8: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

i386

• Segmented Addressing (segment:offset)

• Paging(Virtual Memory)

• Call Gate (Protection)

• TSS (Context Switching)

Page 9: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

T I

GDTR LDTR

GDT LDT

INDEX

SELECTOR

desc desc

OFFSET

+

Linear Address

Page 10: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

BASE LIMIT

BASE+LIMIT

BASE+8

BASE 15:0 LIMIT 15:0

BASE 31:24 AGD0 V L

LIMIT19:16 BASE 23:16TYPE

DP P S L

031

3263

Desc., Call gate, TSS

Page 11: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

yyyyy000zzzzz000

CR3

ddd ttt ooo

4K page

zzzzzooo+

PTEPDE

Page Addr. P

Page 12: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Physical memory

Disk

Linear Address Space

4GBOS

Page 13: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

3

210

Call Gate

Page 14: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Call TSS gate cause context switching

TSS Gate TSS desc.

CS,DS, ES…IPSP0, SP1,SP2, SP3CR3…..

in GDT

CPU

Page 15: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

• #RESET– real-address mode– self-test– EAX contains error code– EDX contains CPU id– CR0

i386 Initialization

PG

PE

TS

EM

M P

RESERVED

0

Page 16: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

EFLAGSEIPCS*DS**SSES**FSGSIDTR(base)IDTR(limit)DR7

0XXXX0002H0000FFF0H0F000H0000H0000H0000H0000H0000H00000000H03FFH0000H

Register State

* invisible part: 0FFFF0000(base) 0FFFF(limit)** invisible part: 0(base) 0FFFF(limit)

Page 17: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

FFFF0H : ROM-BIOS address* do some test* initialize interrupt vector at physical address 0* load the first sector of a bootable device to 0x7C00 (boot/bootsect.S)* jump to 0x7C00 and run

Page 18: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Linux Kernel on Disk (vmlinux, 1,133,665 bytes)

bootsect.S Setup.S

1 sector 4 sectors

Self-extracted Kernel Image

Compressed Kernel Image (vmlinux.out, 455,321)

vmlinux (executable)

Decompressionmodule

/usr/src/linux/arch/i386/boot/zImage

Page 19: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

boot disk

CPUA20

1M

A0000

I/O & BIOS

7C000

90000

IP

64K

Page 20: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

0.5K bytes

7C000

Bootsect.S

BIOS load

IP 7C000

90000IP

bootsect.S

0.5K bytes

0.5K bytes

Page 21: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

0.5K bytes7C000

90000IP

2K bytes

90200

Setup.S

0.5K bytes7C000

0.5K bytes90000

IP

2K bytes

90200

Setup.S

10000

508K bytes

0.5K bytes

vmlinux

Page 22: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

SETUPSECS = 4 ! nr of setup-sectorsBOOTSEG = 0x07C0 ! original address of boot-sectorINITSEG = DEF_INITSEG ! we move boot here - out of the way 0x9000SETUPSEG = DEF_SETUPSEG ! setup starts here, 0x9020SYSSEG = DEF_SYSSEG ! system loaded at 0x10000 (65536)

< omitted>

mov ax,#BOOTSEG mov ds,ax mov ax,#INITSEG mov es,ax mov cx,#256 sub si,si sub di,di cld rep movsw

jmpi go,INITSEG ! Execute moved bootsectgo:

Copy bootsect.S to 0x90000

Page 23: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

<omit>load_setup:

xor dx, dx ! drive 0, head 0 mov cl,#0x02 ! sector 2, track 0 mov bx,#0x0200 ! address = 512, in INITSEG mov ah,#0x02 ! service 2, nr of sectors mov al,setup_sects ! (assume all on head 0, track 0) ! Setup_sects=4 int 0x13 ! read it (BIOS routine) jnc ok_load_setup ! ok - continue

push ax ! dump error code call print_nl mov bp, sp call print_hex pop ax

jmp load_setupok_load_setup:

Try to load setup.S from(drive 0, head 0,sector 2, track 0)to memory 0x90200

Page 24: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

<omit>! Print some inane message mov ah,#0x03 ! read cursor pos xor bh,bh int 0x10 mov cx,#9 mov bx,#0x0007 ! page 0, attribute 7 (normal) mov bp,#msg1 ! .byte 13,10 .ascii “Loading” mov ax,#0x1301 ! write string, move cursor int 0x10 ! BIOS routine

! ok, we've written the message, now! we want to load the system (at 0x10000) mov ax,#SYSSEG mov es,ax ! segment of 0x010000 call read_it ! Read 508K to 0x10000 (64K), one . per track call kill_motor ! Stop floopy motor call print_nl<omit> jmpi 0, SETUPSEG ! Jump to 0x90200 (setup.S)

Print “/nLoading”

Page 25: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

setup.S

• Check memory size

• set keyboard, video adapter, get HD data

• switch to protected mode– set GDT– set IDT– set PE bit (flush pipe)

Page 26: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

start: jmp start_of_setup! ------------------------ start of header --------------------------------!! SETUP-header, must start at CS:2 (old 0x9020:2)! .ascii "HdrS" ! Signature for SETUP-header .word 0x0201 ! Version number of header format ! (must be >= 0x0105 ! else old loadlin-1.5 will fail)

<omit>start_of_setup:

…………… (check signature)

good_sig: mov ax,cs ! aka #SETUPSEG sub ax,#DELTA_INITSEG ! aka #INITSEG mov ds,ax ! DS=9000

Page 27: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

loader_ok:! Get memory size (extended mem, kB)

mov ah,#0x88 int 0x15 mov [2],ax ! Store memory size in 0x90002 (bootsect.S)

<omit>(disable interrupts)(move kernel image to 1000)

end_move_self: lidt idt_48 ! load idt with 0,0 lgdt gdt_48 ! load gdt with whatever appropriate

idt_48:.word 0.word 0, 0

gdt_48:.word 0x800.word 512+gdt, 0x9

Page 28: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

BASE Limit

0,0 0idt_48

0x9, 512+gdt 0x800 (2048)gdt_48gdt: .word 0,0,0,0 ! dummy

.word 0,0,0,0 ! unused

.word 0xFFFF ! 4Gb - (0x100000*0x1000 = 4Gb) .word 0x0000 ! base address=0 .word 0x9A00 ! code read/exec .word 0x00CF ! granularity=4096, 386 (+5th nibble of limit)

.word 0xFFFF ! 4Gb - (0x100000*0x1000 = 4Gb) .word 0x0000 ! base address=0 .word 0x9200 ! data read/write .word 0x00CF ! granularity=4096, 386 (+5th nibble of limit)

Page 29: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

BASE 15:0 LIMIT 15:0

BASE 31:24 AGD0 V L

LIMIT19:16 BASE 23:16TYPE

DP P S L

031

3263

null

Not used

code

data

BASE=0x00000000, LIMIT=FFFFFF G=1 (4G)DPL=0 type=1010 (code, non-conforming, r/x, not accessed)

BASE=0x00000000, LIMIT=FFFFFF G=1 (4G)DPL=0 type=1010 (code, non-conforming, r/x, not accessed)

Page 30: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

! that was painless, now we enable A20, no wrapped

call empty_8042 mov al,#0xD1 ! command write out #0x64,al call empty_8042 mov al,#0xDF ! A20 on out #0x60,al call empty_8042

<omit>

mov ax,#1 ! protected mode (PE) bit lmsw ax ! This is it! Load into CR0 jmp flush_instr ! Flush pipeflush_instr: xor bx,bx ! Flag to indicate a boot

Page 31: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

! NOTE: For high loaded big kernels we need a! jmpi 0x100000,KERNEL_CS!! but we yet haven't reloaded the CS register, so the default size ! of the target offset still is 16 bit.! However, using an operant prefix (0x66), the CPU will properly! take our 48 bit far pointer. (INTeL 80386 Programmer's Reference! Manual, Mixing 16-bit and 32-bit code, page 16-6) db 0x66,0xea ! prefix + jmpi-opcodecode32: dd 0x1000 ! will be set to 0x100000 for big kernels dw KERNEL_CS ! KERNEL=0x10

0 0 0001 0000

TI

RPL

15 2 1 0

INDEX

0:GDT 1:LDT

Page 32: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Decompress Kernelstartup_32: (gcc entry point) cld

cli movl $(KERNEL_DS),%eax # KERNEL_DS=0x18 mov %ax,%ds mov %ax,%es mov %ax,%fs mov %ax,%gs

<omit>

lss SYMBOL_NAME(stack_start),%esp xorl %eax,%eax1: incl %eax # check that A20 really IS enabled movl %eax,0x000000 # loop forever if it isn't cmpl %eax,0x100000 je 1b

Page 33: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

( clear BSS )

/* * Do the decompression, and jump to the new kernel.. */ subl $16,%esp # place for structure on the stack pushl %esp # address of structure as first arg call SYMBOL_NAME(decompress_kernel) # decompress kernel to 100000 orl %eax,%eax # gunzip 1.0.3 jnz 3f xorl %ebx,%ebx ljmp $(KERNEL_CS), $0x100000 # jump to decompressed kernel

Page 34: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

100000

101000

102000

103000

104000

105000

106000

swapper_pg_dir

pg0

empty_bad_page

empty_bad_page_table

empty_zero_page

stack

idtgdt

EIP

head.S

(copy parameters from 0x90000)

Page 35: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

100000

101000

102000

103000

104000

105000

106000

PG_DIR

PG0

empty_bad_page

empty_bad_page_table

empty_zero_page

stack

idtgdt

CR3

0

768 4M

Physical Memory

Setup Paging Table & Enable Paging

Page 36: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

100000

101000

102000

103000

104000

105000

106000

PG_DIR

PG0

empty_bad_page

empty_bad_page_table

empty_zero_page

stack

idtgdtGDTR

NULL0

00

2*NR_TASKS

C0000000 1G DPL=0 codeC0000000 1G DPL=0 data00000000 3G DPL=3 code00000000 3G DPL=3 data

0x100x180x230x2b

Setup GDT

Page 37: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

100000

101000

102000

103000

104000

105000

106000

PG_DIR

PG0

empty_bad_page

empty_bad_page_table

empty_zero_page

stack

idtgdt

255

0 GDT

ignore_int

IDTR

Setup IDT

Page 38: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

call setup_paging

setup_paging: movl $1024*2,%ecx /* 2 pages - swapper_pg_dir+1 page table */ xorl %eax,%eax movl $ SYMBOL_NAME(swapper_pg_dir),%edi /* swapper_pg_dir is at 0x1000 */ cld;rep;stosl/* Identity-map the kernel in low 4MB memory for ease of transition *//* set present bit/user r/w */ movl $ SYMBOL_NAME(pg0)+7,SYMBOL_NAME(swapper_pg_dir)/* But the real place is at 0xC0000000 *//* set present bit/user r/w */ movl $ SYMBOL_NAME(pg0)+7,SYMBOL_NAME(swapper_pg_dir)+3072 movl $ SYMBOL_NAME(pg0)+4092,%edi movl $0x03ff007,%eax /* 4Mb - 4096 + 7 (r/w user,p) */ std1: stosl /* fill the page backwards - more efficient :-) */ subl $0x1000,%eax jge 1b cld

Page 39: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

movl $ SYMBOL_NAME(swapper_pg_dir),%eax movl %eax,%cr3 /* cr3 - page directory start */ movl %cr0,%eax orl $0x80000000,%eax movl %eax,%cr0 /* set paging (PG) bit */ ret /* this also flushes the prefetch-queue */

31 12 6 5 2 1 0

Page Address D AU /S

R /W

P

Format of PDE & PTE

Page 40: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

lgdt gdt_descr

gdt_descr: .word (8+2*NR_TASKS)*8-1 .long 0xc0000000+SYMBOL_NAME(gdt)

ENTRY(gdt) .quad 0x0000000000000000 /* NULL descriptor */ .quad 0x0000000000000000 /* not used */ .quad 0xc0c39a000000ffff /* 0x10 kernel 1GB code at 0xC0000000 */ .quad 0xc0c392000000ffff /* 0x18 kernel 1GB data at 0xC0000000 */ .quad 0x00cbfa000000ffff /* 0x23 user 3GB code at 0x00000000 */ .quad 0x00cbf2000000ffff /* 0x2b user 3GB data at 0x00000000 */ .quad 0x0000000000000000 /* not used */ .quad 0x0000000000000000 /* not used */ .fill 2*NR_TASKS,8,0 /* space for LDT's and TSS's etc */

Page 41: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

(setup data segments and clear BSS)call setup_idt

setup_idt: lea ignore_int,%edx movl $(KERNEL_CS << 16),%eax movw %dx,%ax /* selector = 0x0010 = cs */ movw $0x8E00,%dx /* interrupt gate - dpl=0, present */

lea SYMBOL_NAME(idt),%edi mov $256,%ecxrp_sidt: movl %eax,(%edi) movl %edx,4(%edi) addl $8,%edi dec %ecx jne rp_sidt ret

Page 42: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

SELECTOR OFFSET

OFFSET 8 E 0 0

interrupt gate

ignore_int: just print “Unknown Interrupt”

Page 43: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

lidt idt_descr ljmp $(KERNEL_CS),$1f1: movl $(KERNEL_DS),%eax # reload all the segment registers mov %ax,%ds # after changing gdt. mov %ax,%es mov %ax,%fs mov %ax,%gs

call SYMBOL_NAME(start_kernel) # jump to C main routine

Page 44: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

start_kernelasmlinkage void start_kernel(void) {

setup_arch(&command_line, &memory_start, &memory_end); memory_start = paging_init(memory_start,memory_end); trap_init(); init_IRQ();

<-------------- omit ---------------->

memory_start = console_init(memory_start,memory_end);

memory_start = kmalloc_init(memory_start,memory_end); sti(); # enable interrupt

Page 45: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

memory_start = inode_init(memory_start,memory_end); memory_start = file_table_init(memory_start,memory_end); memory_start = name_cache_init(memory_start,memory_end);

mem_init(memory_start,memory_end);

<---------- omit ------------->

printk(linux_banner);

sysctl_init(); kernel_thread(init, NULL, 0); cpu_idle(NULL);}

Page 46: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

setup_arch

1M

kernelmemory_start

memory_start = (unsigned long) &_end;

memory_end

memory_end = (1<<20) + (EXT_MEM_K<<10); memory_end &= PAGE_MASK;

#define PARAM empty_zero_page#define EXT_MEM_K (*(unsigned short *) (PARAM+2))

Page 47: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

init_task.mm->start_code = TASK_SIZE; /* 0xC0000000 */ init_task.mm->end_code = TASK_SIZE + (unsigned long) &_etext; init_task.mm->end_data = TASK_SIZE + (unsigned long) &_edata; init_task.mm->brk = TASK_SIZE + (unsigned long) &_end;

/ * "mem=XXX[kKmM]" overrides the BIOS-reported memory size */

if (c == ' ' && *(const unsigned long *)from == *(const unsigned long *)"mem=")

memory_end = simple_strtoul(from+4, &from, 0); if ( *from == 'K' || *from == 'k' ) { memory_end = memory_end << 10; from++; } else if ( *from == 'M' || *from == 'm' ) { memory_end = memory_end << 20; from++; }

Page 48: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

paging_init

1M

kernelpg_dir

pg0

memory_startpg1

pg2

pgn01

768769

pg0pg1pg2

pgn

n

4M

4M

Page 49: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

start_mem = PAGE_ALIGN(start_mem); address = 0; pg_dir = swapper_pg_dir; while (address < end_mem) {

/* map the memory at virtual addr 0xC0000000 */ pg_table = (pte_t *) (PAGE_MASK & pgd_val(pg_dir[768])); if (!pg_table) { pg_table = (pte_t *) start_mem; start_mem += PAGE_SIZE; }

/* also map it temporarily at 0x0000000 for init */ pgd_val(pg_dir[0]) = _PAGE_TABLE | (unsigned long) pg_table; pgd_val(pg_dir[768]) = _PAGE_TABLE | (unsigned long) pg_table; pg_dir++;

Page 50: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

for (tmp = 0 ; tmp < PTRS_PER_PTE ; tmp++,pg_table++) { if (address < end_mem) set_pte(pg_table, mk_pte(address, PAGE_SHARED)); else pte_clear(pg_table); address += PAGE_SIZE; } } local_flush_tlb(); /* move cr3, r?; mov r?, cr3; */ return free_area_init(start_mem, end_mem);

Page 51: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

free_area_init

1. Set min_free_pages2. Initialize swap cache3. Mark all pages reserved4. Initialize Buddy system for free memory management

Page 52: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Free Memory Management (Tanenbaum)• Bitmap

• Linked list (first-fit, next-fit, best-fit, quick-fit)

0 2 4 6 8 10 12 14 16

0011000011100100

P 0 2 H 2 2 P 4 4 H 8 3

P 11 2 H13 1 P 14 2

Page 53: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Buddy System

A

B

C

A

A B

B

B D

D

C

C

C

C

Initialization

request A (2)

request B (1)

request C (2)

free A*

request D (1)

free B

free D

free C

0 2 4 6 8 10 12 14 16page

Page 54: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

B

0

1

0

0

0

0

0

00

0

1

1

0

0

0

1

2

3

free_area

0 1 2 3 4 5 6 7 8 9101112131415

mem_map

8

0 6

3

C

Request D (1)

Page 55: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

0

0

0

0

0

0

0

00

0

1

1

0

0

0

1

2

3

free_area

0 1 2 3 4 5 6 7 8 9101112131415

mem_map

8

0 6

C

BD

Free B

Page 56: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

0

1

0

0

0

0

0

00

0

1

1

0

0

0

1

2

3

free_area

0 1 2 3 4 5 6 7 8 9101112131415

mem_map

8

0 6

C

D

2

Free D

Page 57: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

0

0

0

0

0

0

0

00

0

1

0

1

0

0

1

2

3

free_area

0 1 2 3 4 5 6 7 8 9101112131415

mem_map

8

6

C0

Free C

Page 58: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

0

0

0

0

0

0

0

00

0

0

0

0

0

0

1

2

3

free_area

0 1 2 3 4 5 6 7 8 9101112131415

mem_map

80

Request 2

Page 59: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

0

0

0

0

0

0

0

00

0

0

1

1

0

0

1

2

3

free_area

0 1 2 3 4 5 6 7 8 9101112131415

mem_map

8

4

2

Page 60: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Kernel

pg1-pgn

swap cache

mem_map

free_area[].bitmap

start_mem

(4 bytes per page)

typedef struct page { /* these must be first (free area handling) */ struct page *next; struct page *prev; struct inode *inode; unsigned long offset; ……….. atomic_t count; unsigned flags; unsigned dirty:16, age:8; ……... unsigned long map_nr; /* page->map_nr == page - mem_map */} mem_map_t;

Page 61: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

0

0

0

0

0

0

0

00

0

0

0

0

0

0

1

2

3

free_area

0 1 2 3 4 5 6 7 8 9101112131415

mem_map

Page 62: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

unsigned long free_area_init(unsigned long start_mem, unsigned long end_mem){

/* * select nr of pages we try to keep free for important stuff * with a minimum of 48 pages. This is totally arbitrary */ i = (end_mem - PAGE_OFFSET) >> (PAGE_SHIFT+7); if (i < 24) i = 24; i += 24; /* The limit for buffer pages in __get_free_pages is * decreased by 12+(i>>3) */ min_free_pages = i;

start_mem = init_swap_cache(start_mem, end_mem); mem_map = (mem_map_t *) start_mem; p = mem_map + MAP_NR(end_mem); start_mem = LONG_ALIGN((unsigned long) p); memset(mem_map, 0, start_mem - (unsigned long) mem_map);

Page 63: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

do { --p; p->flags = (1 << PG_DMA) | (1 << PG_reserved); p->map_nr = p - mem_map; } while (p > mem_map); /* 6 */ for (i = 0 ; i < NR_MEM_LISTS ; i++) { unsigned long bitmap_size; init_mem_queue(free_area+i); mask += mask; /* mask *=2 */ end_mem = (end_mem + ~mask) & mask; /* should be i+1 */ bitmap_size = (end_mem - PAGE_OFFSET) >> (PAGE_SHIFT + i); bitmap_size = (bitmap_size + 7) >> 3; bitmap_size = LONG_ALIGN(bitmap_size); free_area[i].map = (unsigned int *) start_mem; memset((void *) start_mem, 0, bitmap_size); start_mem += bitmap_size; } return start_mem;}

Page 64: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

trap_init

1. Setup interrupt routines2. Int 0x80 for system call3. Setup TSS and LDT in GDT for each task

Page 65: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

486 Exceptions

0 Fault Divided by Zero1 Fault Debug…..0B Fault Not Present…..0D Fault General Protection0E Fault Page Fault

…..

20-FF Int/Trap Used for OS

Page 66: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

void trap_init(void){ set_call_gate(&default_ldt,lcall7); set_trap_gate(0,&divide_error); set_trap_gate(1,&debug); set_trap_gate(2,&nmi); set_system_gate(3,&int3); /* int3-5 can be called from all */ set_system_gate(4,&overflow); set_system_gate(5,&bounds); set_trap_gate(6,&invalid_op); set_trap_gate(7,&device_not_available); set_trap_gate(8,&double_fault); set_trap_gate(9,&coprocessor_segment_overrun); set_trap_gate(10,&invalid_TSS); set_trap_gate(11,&segment_not_present); set_trap_gate(12,&stack_segment); set_trap_gate(13,&general_protection); set_trap_gate(14,&page_fault); set_trap_gate(15,&spurious_interrupt_bug); set_trap_gate(16,&coprocessor_error); set_trap_gate(17,&alignment_check);

Page 67: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

for (i=18;i<48;i++) set_trap_gate(i,&reserved); set_system_gate(0x80,&system_call); /* set up GDT task & ldt entries */ p = gdt+FIRST_TSS_ENTRY; set_tss_desc(p, &init_task.tss); /* init_task: hardwired task #0 */ p++; set_ldt_desc(p, &default_ldt, 1); p++;

for(i=1 ; i<NR_TASKS ; i++) { p->a=p->b=0; p++; p->a=p->b=0; p++; }

Page 68: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

set_call_gate(a, addr) set_gate(a, 12, 3, addr)

set_trap_gate(n, addr) set_gate(&idt[n], 15, 0, addr)

set_system_gate(n, addr) set_gate(&idt[n], 15, 3, addr)

set_intr_gate(n, addr) set_gate(&idt[n], 14, 0, addr)

#define _set_gate(gate_addr,type,dpl,addr) \__asm__ __volatile__ ("movw %%dx,%%ax\n\t" \ "movw %2,%%dx\n\t" \ "movl %%eax,%0\n\t" \ "movl %%edx,%1" \ :"=m" (*((long *) (gate_addr))), \ "=m" (*(1+(long *) (gate_addr))) \ :"i" ((short) (0x8000+(dpl<<13)+(type<<8))), \ "d" ((char *) (addr)),"a" (KERNEL_CS << 16) \ :"ax","dx")

Page 69: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

SEGMENT SELECTOR OFFSET 15:0

OFFSET 31:24 DP P L

031

3263

TYPE 000 RESERVED

Descriptor in IDT

Page 70: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

mem_init

• Reserve kernel and I/O pages

• Return all unused pages to buddy system

Page 71: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

pg1-pgn

swap_cache

mem_map

free_area[].map

Console,PCI & FS

end_text

reserved

0x100000

0xA0000

data

code

start_mem

high_mem

start_low_mem4K

Page 72: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

void mem_init(unsigned long start_mem, unsigned long end_mem){ end_mem &= PAGE_MASK; high_memory = end_mem;

/* mark usable pages in the mem_map[] */ start_low_mem = PAGE_ALIGN(start_low_mem);

start_mem = PAGE_ALIGN(start_mem);

/* * IBM messed up *AGAIN* in their thinkpad: 0xA0000 -> 0x9F000. * They seem to have done something stupid with the floppy * controller as well.. */ while (start_low_mem < 0x9f000) { clear_bit(PG_reserved, &mem_map[MAP_NR(start_low_mem)].flags); start_low_mem += PAGE_SIZE; }

Page 73: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

while (start_mem < high_memory) { clear_bit(PG_reserved, &mem_map[MAP_NR(start_mem)].flags); start_mem += PAGE_SIZE; }

for (tmp = 0 ; tmp < high_memory ; tmp += PAGE_SIZE) { if (tmp >= MAX_DMA_ADDRESS) /* 16M */ clear_bit(PG_DMA, &mem_map[MAP_NR(tmp)].flags); if (PageReserved(mem_map+MAP_NR(tmp))) { if (tmp >= 0xA0000 && tmp < 0x100000) reservedpages++; else if (tmp < (unsigned long) &_etext) codepages++; else datapages++; continue; } mem_map[MAP_NR(tmp)].count = 1;

free_page(tmp); }

Page 74: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

tmp = nr_free_pages << PAGE_SHIFT;

printk("Memory: %luk/%luk available (%dk kernel code, %dk reserved, %dk data)\n", tmp >> 10, high_memory >> 10, codepages << (PAGE_SHIFT-10), reservedpages << (PAGE_SHIFT-10), datapages << (PAGE_SHIFT-10));

return;}

Page 75: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

#define free_page(addr) free_pages((addr),0)

void free_pages(unsigned long addr, unsigned long order){ unsigned long map_nr = MAP_NR(addr);

if (map_nr < MAP_NR(high_memory)) { mem_map_t * map = mem_map + map_nr; if (PageReserved(map)) return; if (atomic_dec_and_test(&map->count)) { delete_from_swap_cache(map_nr); free_pages_ok(map_nr, order); return; } }}

Page 76: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

static inline void free_pages_ok(unsigned long map_nr, unsigned long order){ struct free_area_struct *area = free_area + order; unsigned long index = map_nr >> (1 + order); unsigned long mask = (~0UL) << order;

cli();

#define list(x) (mem_map+(x)) map_nr &= mask;

nr_free_pages -= mask; /* -mask = 1+~mask */ while (mask + (1 << (NR_MEM_LISTS-1))) { if (!change_bit(index, area->map) ) break; remove_mem_queue(list(map_nr ^ -mask)); /* neighbor */ mask <<= 1; area++; index >>= 1; map_nr &= mask; } add_mem_queue(area, list(map_nr));#undef list}

Page 77: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

extern inline unsigned long get_free_page(int priority){ unsigned long page;

page = __get_free_page(priority); if (page) memset((void *) page, 0, PAGE_SIZE); return page;}

#define __get_free_page(priority) __get_free_pages((priority),0,0)

Page 78: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

unsigned long __get_free_pages(int priority, unsigned long order, int dma){ unsigned long flags; int reserved_pages;

if (order >= NR_MEM_LISTS) return 0; if (intr_count && priority != GFP_ATOMIC) { static int count = 0; if (++count < 5) { printk("gfp called nonatomically from interrupt %p\n", __builtin_return_address(0)); priority = GFP_ATOMIC; } } reserved_pages = 5; if (priority != GFP_NFS) reserved_pages = min_free_pages; if ((priority == GFP_BUFFER || priority == GFP_IO) && reserved_pages >= 48) reserved_pages -= (12 + (reserved_pages>>3)); save_flags(flags);

Page 79: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

repeat: cli(); if ((priority==GFP_ATOMIC) || nr_free_pages > reserved_pages) { RMQUEUE(order, dma); restore_flags(flags); return 0; } restore_flags(flags); if (priority != GFP_BUFFER && try_to_free_page(priority, dma, 1)) goto repeat; return 0;}

Page 80: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

/* * Some ugly macros to speed up __get_free_pages().. */#define MARK_USED(index, order, area) \ change_bit((index) >> (1+(order)), (area)->map)#define CAN_DMA(x) (PageDMA(x))#define ADDRESS(x) (PAGE_OFFSET + ((x) << PAGE_SHIFT))

Page 81: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

#define RMQUEUE(order, dma) \do { struct free_area_struct * area = free_area+order; \ unsigned long new_order = order; \ do { struct page *prev = memory_head(area), *ret; \ while (memory_head(area) != (ret = prev->next)) { \ if (!dma || CAN_DMA(ret)) { \ unsigned long map_nr = ret->map_nr; \ (prev->next = ret->next)->prev = prev; \ MARK_USED(map_nr, new_order, area); \ nr_free_pages -= 1 << order; \ EXPAND(ret, map_nr, order, new_order, area); \ restore_flags(flags); \ return ADDRESS(map_nr); \ } \ prev = ret; \ } \ new_order++; area++; \ } while (new_order < NR_MEM_LISTS); \} while (0)

Page 82: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

#define EXPAND(map,index,low,high,area) \do { unsigned long size = 1 << high; \ while (high > low) { \ area--; high--; size >>= 1; \ add_mem_queue(area, map); \ MARK_USED(index, high, area); \ index += size; \ map += size; \ } \ map->count = 1; \ map->age = PAGE_INITIAL_AGE; \} while (0)

Page 83: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

kernel_threadcall sys_clone();

if (StackIsChanged() /* new process */) { call fn(args); sys_exit();} else { /* do nothing */ /* task[0] goes through here*/}

CPU_idle()

sys_idle()

schedule()

Page 84: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

static inline pid_t kernel_thread(int (*fn)(void *), void * arg, unsigned long flags){ long retval;

__asm__ __volatile__( "movl %%esp,%%esi\n\t" "int $0x80\n\t" /* Linux/i386 system call */ "cmpl %%esp,%%esi\n\t" /* child or parent? */ "je 1f\n\t" /* parent - jump */ "pushl %3\n\t" /* push argument */ "call *%4\n\t" /* call fn */ "movl %2,%0\n\t" /* exit */ "int $0x80\n" "1:\t" :"=a" (retval) :"0" (__NR_clone), "i" (__NR_exit), "r" (arg), "r" (fn), "b" (flags | CLONE_VM) :"si"); return retval;}

Page 85: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

System Calls/* * This file contains the system call numbers. Unistd.h */

#define __NR_setup 0 /* used only by init, to get system going */#define __NR_exit 1#define __NR_fork 2#define __NR_read 3#define __NR_write 4#define __NR_open 5……..#define __NR_clone 120……..#define __NR_sched_rr_get_interval 161#define __NR_nanosleep 162#define __NR_mremap 163

Page 86: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

.data /* entry.S */ENTRY(sys_call_table) .long SYMBOL_NAME(sys_setup) /* 0 */ .long SYMBOL_NAME(sys_exit) .long SYMBOL_NAME(sys_fork) .long SYMBOL_NAME(sys_read) .long SYMBOL_NAME(sys_write) .long SYMBOL_NAME(sys_open) /* 5 */…….. .long SYMBOL_NAME(sys_clone) /* 120 */…….. .long SYMBOL_NAME(sys_sched_rr_get_interval) .long SYMBOL_NAME(sys_nanosleep) .long SYMBOL_NAME(sys_mremap) .long 0,0 .long SYMBOL_NAME(sys_vm86) .space (NR_syscalls-166)*4 /* 256 */

Page 87: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Pseudo Code for System Call

if (sys_call_num >= NR_syscalls) return -ENOSYS;else { if (sys_call_table[sys_call_sum]==NULL) return -ENOSYS; if (PF_TRACESYS) { syscall_trace(); call sys_call_table[sys_call_num]; syscall_trace(); } else call sys_call_table[sys_call_num];

Page 88: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

ENTRY(system_call) pushl %eax # save orig_eax, for syscall_trace (strace) SAVE_ALL

0(%esp) - %ebx 4(%esp) - %ecx 8(%esp) - %edx C(%esp) - %esi 10(%esp) - %edi 14(%esp) - %ebp # SAVE_ALL 18(%esp) - %eax 1C(%esp) - %ds 20(%esp) - %es 24(%esp) - %fs 28(%esp) - %gs 2C(%esp) - orig_eax # pushl %eax 30(%esp) - %eip 34(%esp) - %cs # push by CPU, int 0x80 38(%esp) - %eflags 3C(%esp) - %oldesp # push by CPU, stack switching 40(%esp) - %oldss

STACK

Page 89: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

movl $-ENOSYS,EAX(%esp) cmpl $(NR_syscalls),%eax # EAX=SYS_CALL_NUM jae ret_from_sys_call movl SYMBOL_NAME(sys_call_table)(,%eax,4),%eax testl %eax,%eax je ret_from_sys_call

…….. testb $0x20,flags(%ebx) # PF_TRACESYS jne 1f call *%eax movl %eax,EAX(%esp) # save the return value jmp ret_from_sys_call ALIGN1: call SYMBOL_NAME(syscall_trace) movl ORIG_EAX(%esp),%eax call SYMBOL_NAME(sys_call_table)(,%eax,4) movl %eax,EAX(%esp) # save the return value

call SYMBOL_NAME(syscall_trace)

Page 90: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

sys_cloneasmlinkage int sys_clone(struct pt_regs regs){ unsigned long clone_flags; unsigned long newsp;

clone_flags = regs.ebx; newsp = regs.ecx; if (!newsp) newsp = regs.esp; return do_fork(clone_flags, newsp, &regs);}

Page 91: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

do_fork

• Copy process structure from parent

Page 92: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

int do_fork(unsigned long clone_flags, unsigned long usp, struct pt_regs *regs){ int nr; int error = -ENOMEM; unsigned long new_stack; struct task_struct *p;

p = (struct task_struct *) kmalloc(sizeof(*p), GFP_KERNEL); if (!p) goto bad_fork; new_stack = alloc_kernel_stack(); /* get_free_page(GFP_KERNEL) */ if (!new_stack) goto bad_fork_free_p; error = -EAGAIN; nr = find_empty_process(); if (nr < 0) goto bad_fork_free_stack;

*p = *current;

Page 93: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

if (p->exec_domain && p->exec_domain->use_count) (*p->exec_domain->use_count)++; if (p->binfmt && p->binfmt->use_count) (*p->binfmt->use_count)++;

p->did_exec = 0; p->swappable = 0; p->kernel_stack_page = new_stack; *(unsigned long *) p->kernel_stack_page = STACK_MAGIC; p->state = TASK_UNINTERRUPTIBLE; p->flags &= ~(PF_PTRACED|PF_TRACESYS|PF_SUPERPRIV); p->flags |= PF_FORKNOEXEC; p->pid = get_pid(clone_flags); p->next_run = NULL; p->prev_run = NULL; p->p_pptr = p->p_opptr = current; p->p_cptr = NULL; init_waitqueue(&p->wait_chldexit); p->signal = 0;

Page 94: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

p->it_real_value = p->it_virt_value = p->it_prof_value = 0; p->it_real_incr = p->it_virt_incr = p->it_prof_incr = 0; init_timer(&p->real_timer); p->real_timer.data = (unsigned long) p; p->leader = 0; /* session leadership doesn't inherit */ p->tty_old_pgrp = 0; p->utime = p->stime = 0; p->cutime = p->cstime = 0;

p->start_time = jiffies; task[nr] = p; SET_LINKS(p); nr_tasks++;

error = -ENOMEM; /* copy all the process information */ if (copy_files(clone_flags, p)) goto bad_fork_cleanup; if (copy_fs(clone_flags, p)) goto bad_fork_cleanup_files;

Page 95: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

if (copy_sighand(clone_flags, p)) goto bad_fork_cleanup_fs; if (copy_mm(clone_flags, p)) goto bad_fork_cleanup_sighand; copy_thread(nr, clone_flags, usp, p, regs); p->semundo = NULL;

/* ok, now we should be set up.. */ p->swappable = 1; p->exit_signal = clone_flags & CSIGNAL; p->counter = (current->counter >>= 1); wake_up_process(p); /* state=TASK_RUNNING insert into run_queue */ ++total_forks; return p->pid; /* error handler */}

Page 96: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Process’s Virtual Memory

mm

Process’s Virtual Memory

countpgd

mmapmmap_avlmmap_sem

mm_struct

task_struct

vm_endvm_startvm_flagsvm_inodevm_ops

vm_next

vm_endvm_startvm_flagsvm_inodevm_ops

vm_next

vm_area_struct

code

data

nopagewppageswapout….

Page 97: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

struct mm_struct { int count; pgd_t * pgd; unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack, start_mmap; unsigned long arg_start, arg_end, env_start, env_end; unsigned long rss, total_vm, locked_vm; unsigned long def_flags; struct vm_area_struct * mmap; struct vm_area_struct * mmap_avl; struct semaphore mmap_sem;};#define INIT_MM { \ 1, \ swapper_pg_dir, \ 0, 0, 0, 0, \ 0, 0, 0, 0, \ 0, 0, 0, 0, \ 0, 0, 0, \ 0, \ &init_mmap, &init_mmap, MUTEX }

Page 98: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

struct vm_area_struct { struct mm_struct * vm_mm; /* VM area parameters */ unsigned long vm_start; unsigned long vm_end; pgprot_t vm_page_prot; unsigned short vm_flags;/* AVL tree of VM areas per task, sorted by address */ short vm_avl_height; struct vm_area_struct * vm_avl_left; struct vm_area_struct * vm_avl_right;/* linked list of VM areas per task, sorted by address */ struct vm_area_struct * vm_next;/* more */ struct vm_operations_struct * vm_ops; unsigned long vm_offset; struct inode * vm_inode; unsigned long vm_pte; /* shared mem */};

#define INIT_MMAP { &init_mm, 0, 0x40000000, PAGE_SHARED, VM_READ | VM_WRITE | VM_EXEC }

Page 99: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

copy_thread

Copy TSS from parent and set some private fields

Page 100: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

void copy_thread(int nr, unsigned long clone_flags, unsigned long esp, struct task_struct * p, struct pt_regs * regs){ int i; struct pt_regs * childregs;

p->tss.es = KERNEL_DS; p->tss.cs = KERNEL_CS; p->tss.ss = KERNEL_DS; p->tss.ds = KERNEL_DS; p->tss.fs = USER_DS; p->tss.gs = KERNEL_DS; p->tss.ss0 = KERNEL_DS; p->tss.esp0 = p->kernel_stack_page + PAGE_SIZE; p->tss.tr = _TSS(nr); childregs = ((struct pt_regs *) (p->kernel_stack_page + PAGE_SIZE)) - 1; p->tss.esp = (unsigned long) childregs; p->tss.eip = (unsigned long) ret_from_sys_call; *childregs = *regs;

Page 101: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

childregs->eax = 0; childregs->esp = esp; p->tss.back_link = 0; p->tss.eflags = regs->eflags & 0xffffcfff; /* iopl is always 0 for a new process */ p->tss.ldt = _LDT(nr); set_tss_desc(gdt+(nr<<1)+FIRST_TSS_ENTRY,&(p->tss));

p->tss.bitmap = offsetof(struct thread_struct,io_bitmap); for (i = 0; i < IO_BITMAP_SIZE+1 ; i++) /* IO bitmap is actually SIZE+1 */ p->tss.io_bitmap[i] = ~0;}

Page 102: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

ret_from_sys_call

• All slow interrupts and system calls end here

Page 103: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

ret_from_sys_call: cmpl $0,SYMBOL_NAME(intr_count) /* handle interrupts */ jne 2f9: movl SYMBOL_NAME(bh_mask),%eax andl SYMBOL_NAME(bh_active),%eax jne handle_bottom_half

1: sti cmpl $0,SYMBOL_NAME(need_resched) /* to see if we need reschedule*/ jne reschedule ………….

2: RESTORE_ALL

Page 104: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

#define RESTORE_ALL \ ………….. popl %ebx; \ popl %ecx; \ popl %edx; \ popl %esi; \ popl %edi; \ popl %ebp; \ popl %eax; \ pop %ds; \ pop %es; \ pop %fs; \ pop %gs; \ addl $4,%esp; \ iret

Page 105: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

schedule

• Task->count: dynamic priority

• Task->priority: static priority

• time interrupt: (100Hz)

jiffies++

if (current->count <= 0)

need_resched=1;

• run queue: links all RUNNABLE tasks

Page 106: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

asmlinkage void schedule(void){ int c; struct task_struct * p; struct task_struct * prev, * next; unsigned long timeout = 0;

/* check alarm, wake up any interruptible tasks that have got a signal */

allow_interrupts();

if (intr_count) goto scheduling_in_interrupt;

if (bh_active & bh_mask) { intr_count = 1; do_bottom_half(); intr_count = 0; }

Page 107: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

need_resched = 0; prev = current; cli(); /* move an exhausted RR process to be last.. */ if (!prev->counter && prev->policy == SCHED_RR) { prev->counter = prev->priority; move_last_runqueue(prev); } …………. p = init_task.next_run; sti(); c = -1000; next = idle_task; while (p != &init_task) { int weight = goodness(p, prev, this_cpu); if (weight > c) c = weight, next = p; p = p->next_run; }

Page 108: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

/* if all runnable processes have "counter == 0", re-calculate counters */ if (!c) { for_each_task(p) p->counter = (p->counter >> 1) + p->priority; } if (prev != next) { kstat.context_swtch++; ………….. switch_to(prev,next); } return;}

#define switch_to(prev,next) do { \__asm__("movl %2,"SYMBOL_NAME_STR(current_set)"\n\t" \ "ljmp %0\n\t" \ …………….. : /* no outputs */ \ :"m" (*(((char *)&next->tss.tr)-4)), \ "r" (prev), "r" (next)); \} while (0)

Page 109: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

process #1

int 80

system_call

ret_from_sys_call

need_reschedschedule

switch_to

return ret_from_sys_call

iret

process #2

Process Switching

Page 110: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

Page FaultWhen page fault occurs:

error_codeEIPCSEFLAGSold ESPold SS

U /S

W / R

P

CR2: contains fault address

Jump to interrupt handlingroutine for int 0x0Estack

Page 111: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

ENTRY(page_fault) pushl $ SYMBOL_NAME(do_page_fault) jmp error_code

Page 112: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

0(%esp) - %ebx 4(%esp) - %ecx 8(%esp) - %edx C(%esp) - %esi 10(%esp) - %edi 14(%esp) - %ebp # pushl ….. 18(%esp) - %eax 1C(%esp) - %ds 20(%esp) - %es 24(%esp) - %fs 28(%esp) - %gs 2C(%esp) - orig_eax # error_code pushed by CPU 30(%esp) - %eip 34(%esp) - %cs # push by CPU, int 0x80 38(%esp) - %eflags 3C(%esp) - %oldesp # push by CPU, stack switching 40(%esp) - %oldss

STACK

# addr. of do_page_fault

Page 113: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

error_code: push %fs push %es push %ds pushl %eax xorl %eax,%eax pushl %ebp pushl %edi pushl %esi pushl %edx decl %eax # eax = -1 pushl %ecx pushl %ebx cld xorl %ebx,%ebx # zero ebx xchgl %eax, ORIG_EAX(%esp) # orig_eax (get the error code. ) mov %gs,%bx # get the lower order bits of gs movl %esp,%edx xchgl %ebx, GS(%esp) # get the address and save gs. pushl %eax # push the error code (argument) pushl %edx

Page 114: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

movl $(KERNEL_DS),%edx mov %dx,%ds mov %dx,%es movl $(USER_DS),%edx mov %dx,%fs

movl SYMBOL_NAME(current_set),%eax

call *%ebx # call do_page_fault

addl $8,%esp # make a similar stack as system call

jmp ret_from_sys_call

Page 115: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

do_page_fault

• This routine handles page faults. It determines the address, and the problem, and then passes it off to one of the appropriate routines.

• error_code:

bit 0 == 0 means no page found,

1 means protection fault

bit 1 == 0 means read, 1 means write

bit 2 == 0 means kernel, 1 means user-mode

Page 116: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code){ void (*handler)(struct task_struct *, struct vm_area_struct *, unsigned long, int); struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; struct vm_area_struct * vma; ….

/* get the address */ __asm__("movl %%cr2,%0":"=r" (address)); vma = find_vma(mm, address); if (!vma) goto bad_area; if (vma->vm_start <= address) goto good_area; …...

Page 117: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

/* * Something tried to access memory that isn't in our memory map.. * Fix it, but check if it's kernel or user first.. */bad_area: if (error_code & 4) { /* user mode, kill it */ tsk->tss.cr2 = address; tsk->tss.error_code = error_code; tsk->tss.trap_no = 14; force_sig(SIGSEGV, tsk); return; }

…...}

Page 118: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

good_area: handler = do_no_page; switch (error_code & 3) { default: /* 3: write, present */ handler = do_wp_page; /* fall through */ case 2: /* write, not present */ if (!(vma->vm_flags & VM_WRITE)) goto bad_area; break; case 1: /* read, present */ goto bad_area; case 0: /* read, not present */ if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto bad_area; } handler(tsk, vma, address, write); .….. return;

Page 119: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

not present present

write check if you can writedo_no_page do_wp_page

read check if you bad_area can read do_no_page

Page 120: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

do_no_page1. Address is present in memory, just return2. Address in swap area, call so_swap_page to swap it in

cr3

tskpage

disk

Page 121: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

3. If no nopage routine is defined in the vm_area_struct, get a free page and link. (uninitialized data)

4. If a nopage routine is defined in the vm_area_struct, call it (file_mmap_nopage, tries to share pages with other tasks)

cr3

tskpage

get_free_page

Page 122: Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang

do_wp_page1. Address not present, return2. Page is PAGE_RW, return3. If the page is referenced by only one task (count==1), make it PAGE_RW.4. If the page is referenced by more than one task, copy a new page and make it PAGE_RW.

cr3

tsk1 page

cr3

tsk

New pageset PAGE_RW

copy