Why can’t we do ‘raw’ I/O?

Why can’t we do ‘raw’ I/O?

How the x86 stops user-programs from directly controlling devices,

and how we devise a ‘workaround’

x86 Privilege Levels

• For multiple users doing multiple tasks in a manner that affords each some ‘protection’ against inteference by others, any modern CPU will implement two or more separate levels of ‘privilege’ for its operations -- an ‘unrestricted privileges’ arena for the code in its Master Control Program (its ‘kernel’), and a ‘restricted privileges’ realm for code in users’ application programs

Four Privilege Rings

Ring 3

Ring 2

Ring 1

Ring 0

Least-trusted level

Most-trusted level

Suggested purposes

Ring0: operating system kernel

Ring1: operating system services

Ring2: custom extensions

Ring3: ordinary user applications

Unix/Linux and Windows

Ring0: operating system

Ring1: unused

Ring2: unused

Ring3: application programs

IOPL

• The Intel x86 processor includes a way to either allow or prohibit accesses to system peripheral devices by code that executes in the various ‘privilege rings’, by utilizing a 2-bit field within the x86 FLAGS register which controls whether or not ‘in’ and ‘out’ are allowed to execute – the field is known as the I/O Privilege Level field, and Linux normally sets its value to be zero

The x86 API registers

RAX RSP

RBX RBP

RCX RSI

RDX RDI

RIP RFLAGS

CS DS ES FS GS SS

Intel Core-2 Quad processor

R8 R12

R9

R10

R11

R13

R14

R15

The FLAGS register

NT

IOPLOF

DF

IF

TF

SF

ZF

0AF

0PF

1CF

Legend: ZF = Zero Flag SF = Sign Flag IOPL = I/O Privilege LevelCF = Carry Flag NT = Nested TaskPF = Parity Flag TF = Trap FlagOF = Overflow Flag IF = Interrupt FlagAF = Auxiliary Flag DF = Direction Flag

Status-flags

Control-flags

0

13 12

‘seeflags.cpp’

• This demo-program allows us to view the settings of bits in the RFLAGS register – and the IOPL-field in particular (bits 13,12)

• When IOPL == 0, only ring0 code will be able to execute ‘in’ and ‘out’ instructions

• When IOPL == 3, then code executing in any of the rings will be able to execute I/O

• So – let’s change IOPL to 3 – but how?

‘pushfq’/’popfq’

• An idea suggested by the ‘inline’ assembly language in our ‘seeflags.cpp’ demo would be to just ‘pop’ a suitably designed value from the stack into the RFLAGS register

• But the CPU is not about to allow that if it’s currently executing ring3 code while IOPL is set to 0 – that would compromise the system’s intended ‘protection’

Must do it from ring0!

• Our classroom’s Linux systems will allow us to install our own code-module, as an ‘add-on’ to the running kernel, and such code could therefore be executed without any restrictions (i.e., at ring0)

• This idea motivates us to explore briefly the programming ideas needed for writing our own LKM (Linux Kernel Module)

A module’s organization

my_info

module_init

module_exit

The module’s two required administrative functions

The module’s ‘payload’ function

Our ‘newproc.cpp’ utility

• The type of LKM that creates a pseudo-file in the ‘/proc’ directory, there is a ‘skeleton’ of C-language code we can start from, and then add our own specific functionality to that skeleton-code

• You can quickly create this ‘skeleton’ file by using our ‘newproc.cpp’ utility-program

Software interrupts

• One way a user-program, which normally executes in ring3, to switch to ring0 (if it’s allowed) is by using a ‘software interrupt’

• This is how the 32-bit version of Linux did its various system-calls, with ‘int $0x80’

• We can craft an LKM whose ‘payload’ is an interrupt service routine that would be able to change the IOPL from 0 to 3

Systems programming

• To accomplish this design-idea, we’ll need an understanding of our CPU’s interrupt mechanism, including some special data-structures located in kernel memory and some special CPU registers which allow the CPU to locate those data-structures

Descriptor Tables

IDT

Interrupt Descriptor Table (256 Gate Descriptors)

IDTR

GDTR

GDT

Global Descriptor Table (Segment Descriptors)

Special processor registers used by CPU for locating its Descriptor Tables within the system’s memory

IDT Descriptor-format

reserved (=0)

offset 63..32

offset 31..16 00000

32-bits

3

2

1

0 offset 15..0 segment selector

gate type

P 0DPL

IST

LEGEND: segment-selector (for the handler’s code-segment) offset within code-segment to handler’s entry-point gate-type (0xE = Interrupt Gate, 0xF = Trap Gate) IST = Interrupt Stack Table (0..7)

P = Present (1 = yes, 0 = no)

IDTR register-format

Base-Address of the IDT segment (64-bits)segment

limit

80-bits

Special processor instructions are used to ‘load’ this 10-byte register from a memory-image (‘LIDT’), or to ‘store’ this register’s value (‘SIDT’)

The ‘LIDT’ instruction can only be executed by code running in Ring0, but the ‘SIDT’ can be executed by code running at any privilege level.

IDTR:

Stack layout after an interrupt

RSP

RFLAGS

RIPRSP0

32(%rsp)

24(%rsp)

16(%rsp)

8(%rsp)

0(%rsp)

SS

CS

64-bits

Ring0 stack

Our interrupt-9 handler

//-------------------- INTERRUPT SERVICE ROUTINE ----------------- void isr_entry( void ); asm(“ .text “); asm(“ .type isr_entry, @function “); asm(“isr_entry: “); asm( orq $0x3000, 16(%rsp) “); asm( iretq “); //--------------------------------------------------------------------------------------

Our ‘iokludge.c’ kernel module uses this ‘inline’ assembly language to generate the machine-code for handling an interrupt-9, which merely sets the IOPL-field (in the saved image of the RFLAGS register) to 3, and then resumes execution of the interrupted application program.

Core-2 Quad system

Intel Core-2 Quad processor

CPU0

CPU1

CPU2

CPU3

system memory

I/O I/O I/O I/O I/O

system bus

‘smp_call_function()’

• This Linux kernel ‘helper’ routine allows a CPU to request all other CPUs to execute a specified subroutine of type: void function( void *info );

• In our current Linux kernel (vers. 2.6.26.6) this helper-routine takes four arguments:– The address of the subroutine’s entry-point– The address of data the subroutine needs– A flag that indicates whether or not to ‘retry’– A flag that indicates whether or not to ‘wait’

• (Note: Newer kernels omit the ‘retry’ argument)

Working with LKM’s

• Create an LKM skeleton using ‘newproc’

• Compile an new LKM using ‘mmake’

• Install an LKM’s compiled ‘kernel object’ using the Linux ‘/sbin/insmod’ command

• Remove an LKM from the running kernel using the Linux ‘/sbin/rmmod’ command

‘iokludge.c’

module_init: 1) Allocate a kernel memory page, to be used as a new Interrupt Descriptor Table 2) Save original contents of system register IDTR, so it can be restored later 3) Prepare a memory-image for the new value of register IDTR, referring to kpage 4) Setup pointers ‘oldidt’ and ‘newidt’ and copy the original IDT to our new page 5) Setup a Gate-Descriptor, to be installed as Gate 9 in our new IDT array 6) Activate the new Interrupt Descriptor Table on all the processors in our system 7) Return 0, to indicate a successful module-installation

module_exit: 1) Restore the original value to register IDTR in each of our system’s processors 2) Free the page of kernel memory that was previously allocated for use as an IDT

‘tryiopl3.cpp’

• This demo-program is a modification of our earlier ‘seeflags.cpp’ example – but here we included the software interrupt instruction ‘int $9’ which, if ‘iokludge.ko’ has been installed, will allow us to check that indeed the RFLAGS register’s IOPL has been changed from 0 to 3 – thereby permitting ‘in’ and ‘out’ to be executed!

Homework exercise

• Modify the ‘82573pci.cpp’ program that we weren’t able to execute, even with ‘sudo’, at our previous class meeting, replacing its call to Linux’s ‘iopl()’ library-function by the ‘inline’ assembly language statement for software interrupt 9, i.e. asm(“ int $9 “);

• Then try again to compile and execute our ‘82573.cpp’ demo-program, only this time with our ‘iokludge.ko’ LKM installed

Documents

Why can’t we do ‘raw’ I/O?