High Performance Kernel Mode Web Server for Windows

7/30/2019 High Performance Kernel Mode Web Server for Windows

1/57

High performance kernel mode web server for Windows

Degree Project D, 10 credits

Bo Branten

Department for Applied Physics and Electronics at Umea University

1st June 2005


2/57

Abstract

This degree project describes a method to increase the performanceof a web server by splitting it in two parts, one part executes in ker-nel mode and process requests for static web pages and picture fileswhile requests for dynamic web pages that is generated by executingscripts or contains information from an database is handed over to aweb server that executes in user mode. The web server that executesin kernel mode can be optimized for its purpose and minimizes processscheduling and copying of data which means that a larger number ofrequests per second can be processed. The web server that executes inuser mode can be an unmodified standard web server. In the projecta prototype is implemented for the Windows family of operating sys-tems, specific problems that is solved includes network communication

from kernel mode and how to hand over a request to user mode. Theprototype implementation is used to measure performance and evalu-ate how well the method works. The project is also related to earlierwork in Linux.

Sammanfattning

Det har examensarbetet beskriver en metod att hoja prestanda foren webbserver genom att dela upp den i tva delar, den ena delen exe-kverar i system-mode och behandlar anrop till statiska webbsidor ochbildfiler medan anrop till dynamiska webbsidor, dvs sadana som ge-

nereras genom exekvering av skript eller innehaller information somhamtas fran en databas, skickas vidare till en webbserver som exekverari anvandar-mode. Den webbserver som exekverar i system-mode kanoptimeras for sin tillampning och innebar att man minimerar process-schedulering och kopiering av data varfor ett storre antal anrop per se-kund kan behandlas. Den webbserver som exekverar i anvandar-modekan vara en omodifierad webbserver av standardtyp. I examensarbe-tet utvecklas en prototyp for Windows-familjen av operativsystem,speciella problem som loses ar natverkskommunikation fran system-mode och hur overlamning av anrop till anvandar-mode kan utforas.Prototyp-implementationen anvands for att gora prestanda matningaroch utvardera hur val metoden fungerar. Arbetet relateras ocksa tilltidigare arbeten i Linux.

ii


3/57

Contents

1 Introduction 1

1.1 About this report . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 About the author . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Supervisors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Typographical conventions . . . . . . . . . . . . . . . . . . . . 2

1.5 Hardware and software used . . . . . . . . . . . . . . . . . . . 3

1.6 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4

2.1 The Mindcraft benchmark . . . . . . . . . . . . . . . . . . . . 4

2.2 Improvements to the network API . . . . . . . . . . . . . . . 4

2.2.1 sendfile() . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.2 TCP CORK . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.3 Zero-copy networking . . . . . . . . . . . . . . . . . . 6

2.2.4 Delayed accept . . . . . . . . . . . . . . . . . . . . . . 72.2.5 Send and disconnect . . . . . . . . . . . . . . . . . . . 7

2.3 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 kHTTPd . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.2 TUX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 The Windows Networking Architecture 10

3.1 NDIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Transport providers . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 TDI clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Windows Sockets . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Implementation 14

4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1.1 Avoiding splitting sends . . . . . . . . . . . . . . . . . 14

iii


4/57

4.1.2 Zero-copy networking . . . . . . . . . . . . . . . . . . 15

4.1.3 Handing over a request to user mode . . . . . . . . . . 16

4.1.4 Replacing a system DLL . . . . . . . . . . . . . . . . . 18

4.1.5 Socket handles as file handles . . . . . . . . . . . . . . 19

4.2 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.1 Simple kernel mode sockets . . . . . . . . . . . . . . . 20

4.2.2 pthreads . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.3 Kernel mode web server . . . . . . . . . . . . . . . . . 30

5 Performance measurements 32

5.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.1.1 What to measure . . . . . . . . . . . . . . . . . . . . . 32

5.1.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1.3 Things to consider . . . . . . . . . . . . . . . . . . . . 35

5.1.4 Measurement setup . . . . . . . . . . . . . . . . . . . . 37

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Conclusion 46

6.1 Difficulties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.3 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.4 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.6 Commercialisations . . . . . . . . . . . . . . . . . . . . . . . . 49

6.7 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

References 50

A Abbreviations 52

iv


5/57

1 Introduction

This section contains information about the report that is not directly re-lated to the topic itself. Some background to the degree project is given.Information on the author and supervisors is presented. Also some detailson the hardware and software used during the project are given. Finally theacknowledgments are presented.

1.1 About this report

This is a report on a degree project at the department for Applied Physics

and Electronics[1] at Umea University. Some of the areas the departmentfocus on is data communication and embedded systems, they have givencourses on development of device drivers for both Windows and Linux intheir four year education to a Degree of Master of Science with a major inApplied Electronics.

The intention of the degree project is to further develop the students abilityto gain deeper knowledge in a subject and apply it to a bigger project. Italso gives insights in research and development work and is an exercise indocumenting and presenting a project.

As a degree project I have chosen to investigate a method to improve the

performance of a web server and develop a prototype for Windows using thismethod. The project is called High performance kernel mode web serverfor Windows. First this report gives some background on the method andearlier work in Linux, this is followed by a description of the Windows net-working architecture. Then the prototype implementation is described. Thisprototype is used to evaluate how the method works in the Windows en-vironment and performance measurements is done. Finally the results areanalyzed and conclusions are drawn.

The selection of this degree project is based on personal interest and workexperience from device drivers in Windows, particular network communica-tion from kernel mode and file system drivers.

1.2 About the author

The author of this report Bo Branten is studying to a Degree of Master ofScience with a major in Applied Electronics at the department for AppliedPhysics and Electronics at Umea University. In addition to the studies he hasworked with Windows device drivers both as a teaching assistant and in theindustry, he is also internationally recognized in the open source community

1


6/57

for providing information on Windows file system drivers. Table 1 contains

information on how to contact him.

Name Bo BrantenE-mail [email protected]

Table 1: Author contact information

1.3 Supervisors

The supervisor for this degree project is Ulf Brydsten, Fil lic and the ex-

aminator is Dan Weinehall, MSc. Both are working at the department forApplied Physics and Electronics at Umea University. Their contact informa-tion is provided in table 2 and the address to the department is providedin table 3 .

Name Ulf Brydsten Dan WeinehallE-mail [email protected] [email protected]

Table 2: Supervisor contact information

Name Department for Applied Physics and Electronics

Umea UniversityAddress 901 87 UmeaSweden

Phone +46-(0)90-786 50 00Fax +46-(0)90-786 64 69

Table 3: Department contact information

1.4 Typographical conventions

In the report the following typographical conventions is used.

Roman regular Used for normal text.

Roman bold Used for description and table headings.

Constant width Used for source code, email addresses and URLs.

2


7/57

1.5 Hardware and software used

The prototype was implemented on a standard personal computer runningWindows. Table 4 lists the software used.

Windowsr 2000 Professional with Service Pack 4Visual Studio r 6.0 with Service Pack 5Windowsr 2000 Driver Development Kit (DDK)Debugging Tools for Windows r 6.4.7.2SoftICETM 2.5A collection of tools from OSR[2] and Sysinternals[3].

Table 4: Software used

For the evaluation of the prototype a test network was built, the equipmentused is described in section 5 on page 32.

The report was typeset by the author in LATEX.

1.6 Acknowledgments

I would like to thank the Department for Applied Physics and Electronicsfor creating a friendly environment for studies. Ulf Brydsten has been of

great support both during the degree project and in the earlier studies. AlsoI want to thank Anders Myrberg and the others at GTC Security AB[4] forgiven me the opportunity to learn more about the internals of Windows filesystem drivers and network communication from kernel mode.

Umea 1st June 2005 Bo Branten

3


8/57

2 Background

This section gives some background to the project, it describes earlier workin Linux and some methods to improve the network API.

2.1 The Mindcraft benchmark

In April 1999 a company called Mindcraft published a benchmark[5] com-paring Windows NT Server 4.0 and IIS with Red Hat Linux and Apache.The conclusion of the benchmark was:

Microsoft Windows NT Server 4.0 is 2.5 times faster than Linuxas a File Server and 3.7 times faster as a Web Server.

Mindcraft is a company which specializes in benchmarking systems for pay-ing customers, in this case the customer was Microsoft.

This benchmark became heavily debated, it was criticized by the Linuxcommunity for using an improper configuration of Linux and Apache for abig server under high load and exploiting weaknesses in them with the payingcustomer in mind. Even though Mindcraft has admitted that the benchmarkhad problems and has since published new results it showed that there was

a need to implement new system calls and new communication architecturesin the Linux kernel to support high performance web servers, it became thestarting point of a number of activities to respond to that.

2.2 Improvements to the network API

This section describes a number of improvements to the network API thatis related to the performance of web servers.

2.2.1 sendfile()

The traditional way for an application to send a file is to read parts of itinto a buffer and then write the buffer contents to a socket, this means thatthe data will be copied several times to and from the user address spaceand that several context switches is needed. To optimize this a new systemcall sendfile() was developed. It takes two file descriptors as parameters,one referring to a file to read data from and one referring to a socket thatthe data is written to, since this is done in the kernel with one systemcall the amount of data copying and context switches will be reduced. Thesendfile() system call was developed to improve the performance of user

4


9/57


10/57

acknowledge for the first packet, during this wait more characters could

be entered by the user and sent together as one packet or a packet couldarrive from the server, the buffered data can then be sent together withthe acknowledge for that packet. This algorithm will decrease the networkbandwidth used by a terminal application but for a web server that typicallysends a header followed by a file it would split the header and file data intwo packets and introduce a delay for the second packet.

Using the TCP CORK socket option is an important optimization for both usermode and kernel mode web servers.

On Windows there is no socket option that directly corresponds to TCP CORKhowever this is not a problem since the user mode function TransmitFile()

has parameters for an optional header and trailer to the file and the kernelmode TDI interface is using chained MDLs, therefore it is possible to puttogether the data that should be sent as one packet before handing it overto the TCP/IP-stack.

2.2.3 Zero-copy networking

As was described in an earlier section the sendfile() system call copiesdata from the file system cache to the socket buffer. New hardware featuresmakes it possible to eliminate even this final coping of data by implementing

something called zero-copy networking. To utilize zero-copy networkingthe network card must support two features in hardware:

Scatter/gather I/O.

Calculation ofTCP and IP checksums.

When sendfile() is used on a system with this kind of network card isit possible for the TCP/IP-stack to read data directly from the file sys-tem cache, calculate the checksum in hardware and send the packet on thenetwork without any extra copying being done.

An application or kernel mode code doesnt have to do anything special toutilize zero-copy networking, it is implemented by the TCP/IP-stack andthe driver for the network card.

On Windows the requirements to utilize zero-copy networking is the same, itwill be available if the network card and driver supports it. The kernel modeTDI interface is using chained MDLs and the file system interface supportsrequesting such chained MDLs representing data in the file system cache.In user mode is the function TransmitFile() available, it uses the justdescribed features in its implementation.

6


11/57

2.2.4 Delayed accept

A basic design of a web server is to call accept() on the main socket in aloop, when a new connection is established does accept() return a socketrepresenting it and the application creates a new thread or process to han-dle the request where after the loop calls accept() again waiting for moreconnections. The new thread or process will begin by calling recv() thatblocks waiting for data to arrive on the socket.

This process can be optimized by implementing something called delayedaccept. When a client connects to the server is the connection implicitlyaccepted by the TCP/IP-stack but the application isnt notified yet, in-stead the TCP/IP-stack waits until some amount of data has arrived. Theapplication will then be given both a socket representing the new connectionand a buffer containing the first data, for example a HTTP-request, withthe same system call. This will decrease the amount of scheduling needed.On Linux can delayed accept be requested by setting the socket optionTCP DEFER ACCEPT, Windows has instead implemented a special functionAcceptEx() that both do accept and returns the first data received whileFreeBSD even lets the user install an accept filter that can specify what datato receive before accepting an connection.

2.2.5 Send and disconnect

After the web server has sent all the data of a file it notifies the client ofthis by doing disconnect, the standard way to do this is calling shutdown()with the how parameter set to SD SEND followed by close() on the socket.However it is possible that the TCP/IP-stack would transfer the disconnectinformation in its own network packet, therefore it would be an advantage ifone could flag the data sent as the last data to go and have the TCP/IP-stack send the disconnect information together with it. Neither on Windowsnor on Linux does the TCP/IP-stack has an interface to do send and discon-nect as one integrated operation, however the function TransmitFile() onWindows takes a flag TF DISCONNECT that will disconnect the socket after

sending the file.

Something to think of when using the TDI interface on Windows is to notwait for the last (and for small files possible only) send to complete beforedoing disconnect because that means waiting for an acknowledge to arrivefrom the remote node on the data packet, this unnecessarily delays thedisconnect.

This may not sound like the most important optimization but in fact ithelps getting good results in certain benchmarks namely the number of

7


12/57

requests per second for small files using a serialized client. In this case the

largest amount of time is taken by connecting and disconnecting the TCPprotocol, after receiving the file data the client waits for more data to arriveor disconnect to happen before it sends a new request, a fast disconnect willimprove the benchmark result in this situation.

2.3 Previous work

By tradition software is divided into what is executed in user mode respectivewhat is executed in kernel mode and as much as possible is placed in usermode while kernel mode is only chosen when direct access to hardware is

needed or performance motivates it, for example by device drivers.The idea that the Linux developers got was to increase the performance ofa web server by splitting it in two parts, one part executes in kernel modeand process requests for static web pages and picture files while requestsfor dynamic web pages that is generated by executing scripts or containsinformation from an database is handed over to a web server that executesin user mode. The advantages of executing a web server in kernel mode isthat it decreases process scheduling, context switching and copying of data,also it makes it possible to integrate the web server with the TCP/IP-stackand not be limited to the standard socket interface. This makes it possible toprocess a larger number of requests per second and get higher throughput.

The web server that executes in user mode can be an unmodified standardweb server.

In the Linux community two projects has been created based on this method,kHTTPd by Arjan van de Ven and TUX by Ingo Molnar, they are describedin the following sections.

2.3.1 kHTTPd

kHTTPd[6] was the first kernel mode web server project for Linux, it wasstarted in 1999 by Arjan van de Ven as a response to the MindCraft bench-mark. kHTTPd was developed on the 2.2 series of the Linux kernel and wasintegrated in the main kernel distribution from version 2.3.14.

To begin with the kHTTPd server used a simple architecture, it started anew thread for every request and the communication that should be handedover to the user mode web server was copied from socket to socket. HoweverkHTTPd was soon developed to use an architecture that allows for higherperformance, it is shortly described in the following paragraph.

When the kHTTPd kernel module is loaded it creates one thread for each

8


13/57

CPU on the system. The processing of requests is divided in a number of

states with a corresponding queue to each state. The requests is transferredfrom one queue to the next when the state changes. The states are:

Wait for accept.

Wait for HTTP-header.

Send data or hand over to user mode web server.

Cleanup.

The first state wait for accept is queued by the operating system while the

other queues is handled by kHTTPd. After parsing the HTTP-header therequest is either put in the send data queue or handed over to the user modeweb server, kHTTPd selects between them depending on the URL and thefilename extension. If the request is to be handed over to the user mode webserver kHTTPd directly enters it in the accept queue for that socket, thiscan be done since a kernel module in Linux has access to data structures andfunctions internal to the kernel and the source code is available. The normalsetup is that kHTTPd is listening on port 80 while the user mode web serveris listening on some other port for example 8080. In the cleanup queue thesocket is closed and information about the request is logged. The requestswaiting in the queues is regularly checked using non-blocking system calls

to see if any input or output can be done.

The measurements published on the kHTTPd project homepage claims thatit was four to five times faster than Apache on the same system at that time.Later measurements[9] has showed that kHTTPd is faster or at least as fastas the best user mode web servers but that it is outperformed by newerkernel mode web servers using better architectures.

In the 2.5 series of development kernels kHTTPd was again removed from themain kernel distribution for a number of reasons among them the debate if aweb server belongs in the kernel and the availability of TUX, a new projectdescribed in the next section. Also kHTTPd had been un-maintained for a

while. Therefore it was decided that this kind of project is best maintainedseparately outside the main kernel distribution.

2.3.2 TUX

TUX[7][8] stands for Threaded linUX http layer and is designed and im-plemented by Ingo Molnar at Red Hat. Like kHTTPd it is also a kernelmode web server but it differs in design and features. TUX is available forboth the 2.4 and the 2.6 series of the Linux kernel and it is ported to several

9


14/57


15/57

from different vendors to work together and lets the user install new mod-

ules that provide added functionality. The following figure shows the mostimportant parts and how they are related.

LibraryNDIS

Application

WS2_32.DLL

MSAFD.DLL

Windows SocketsService Provider

WSHTCPIP.DLL

Windows Sockets Helperfor TCP/IP

Windows Sockets Helperfor other protocol

Transport provider forother protocol

TCPIP.SYS

NDIS.LIB

HAL.DLL

HAL

NDIS Intermediate driver

NDIS Miniport driver

AFD.SYS

for Windows Sockets

Application

Windows Sockets 1.1

Windows Sockets 2.0

WSOCK32.DLL

Ancillary Function Driver

Transport provider for TCP/IPIPX/SPX Apple Talk NetBEUI

RDR SRV NPFS

TDI

NDIS

TDI clients

Transport providers

User mode

Kernel mode

The Windows Networking Architecture

Starting from bottom is the hardware with one or more Network InterfaceCards (NICs). The software closest to the hardware on the Windows NTfamily of operating systems is the Hardware Abstraction Layer (HAL), itcontains for example functions to access I/O-ports and handle interrupts.The purpose of HAL is to isolate the rest of the operating system fromdifferences in the hardware among different computers within the same CPUarchitecture. The next layer above HAL is NDIS.

3.1 NDIS

The Network Driver Interface Specification (NDIS) was created to sepa-rate drivers for Network Interface Cards (NICs) from protocol implemen-tations. The driver for a NIC is called an NDIS miniport driver. The NDISminiport driver implements the part that is specific to a NIC and uses theNDIS support library for the processing that is common to all NIC drivers.

The NDIS support library envelopes an NDIS miniport driver both aboveand below, it contains functions to access the hardware, on the Windows

11


16/57

NT family of operating systems will the NDIS support library use HAL to

access the hardware. The interface to the transport providers above NDISalso goes through the NDIS support library, it is designed as a call backinterface where the NDIS miniport driver registers functions that the NDISsupport library will call when different events occur.

An NDIS miniport driver implements the interface specified by NDIS andis restricted to call a limited number of functions provided by the NDISsupport library, this makes NDIS miniport drivers binary compatible amongoperating systems that implement the NDIS support library. For exampleMicrosoft supports the use of the same NDIS miniport driver binary onWindows 9x and the Windows NT families of operating systems. It haseven been created modules[10][11] for Linux that allows loading an NDISminiport driver compiled for Windows in the Linux kernel and use it as anetwork driver for Linux.

There is also something called NDIS intermediate drivers that can be putabove NDIS miniport drivers but below the transport providers, they filterthe network traffic at a low level and can be used for example to implement afirewall, however it is also possible to install filter drivers above the transportprovider, so called TDI filter drivers, to filter the network traffic at that level.

3.2 Transport providers

The implementation of a network protocol is called a transport provider,the transport providers implement an interface called Transport DriverInterface (TDI) and uses the NDIS interface below them to talk to thenetwork hardware. A transport provider can dynamically be connected todifferent network cards or other communication channels like IrDA or a SAN,this is called binding and is done by the system administrator. Example oftransport providers included with Windows is TCP/IP, NetBEUI, IPX/SPXand AppleTalk.

3.3 TDI clients

The kernel mode interface to network protocols is TDI and the applicationsthat uses it is called TDI clients. TDI is the only network interface that isavailable to kernel mode code since Windows Sockets is only implemented foruser mode. TDI is an asynchronous event driven interface that is very generalso that it can be used for a wide variety of network types and protocols.Since the TDI interface is so general it has become abstract and is considereddifficult to work with. One of the most important TDI clients is the WindowsSockets implementation that is described in more detail in the next section.

12


17/57

Examples of other TDI clients are RDR and SRV the client respective the

server for the Windows network file system using the SMB protocol.

3.4 Windows Sockets

The most widely used network interface is called BSD sockets since it wasfirst implemented on BSD UNIX. The BSD sockets interface contains func-tions like socket(), connect(), send() and recv(). Microsoft has defineda network interface for Windows that is compatible with BSD sockets, itis called Windows Sockets. On older versions of Windows was WindowsSockets 1.1 used, it is a close implementation of BSD sockets with some

extensions. Windows Sockets 2 is however a more general interface that canbe extended to support other network protocols in addition to TCP/IP.

To provide a new protocol one implement something called a WindowsSockets Helper (WSH) DLL. When the WSH DLL has been registeredin the system can applications request the new protocol when they createa socket handle and then use the ordinary Windows Sockets function tocommunicate using this handle. Windows Sockets will provide the commonfunctionality and call functions in the WSH DLL for tasks that is specificto the new protocol. It is common that the new protocol also has its owntransport provider in kernel mode and that the WSH DLL communicateswith this. For example the TCP/IP transport provider is TCPIP.SYS and

its WSH DLL is WSHTCPIP.DLL.

The kernel mode part of the Windows sockets implementation AFD.SYSis also a file system driver, the reason for this is that it makes it possibleto let file objects and normal file handles represent sockets. Suppose thatan application creates a socket and makes a connection using the functionssocket() and connect(). Now if the application instead of using send()and recv() to transmit and receive data decides to use normal file I/Ofunctions like read() and write() or hand over the handle to an applicationthat dont know that the handle represents a socket. When the applicationcalls read() or write() will the I/O manager know what file system driver

the file object belongs to and call the read or write function that the filesystem driver has exported, since the file system driver for this file objectis the driver implementing sockets can it let its read and write functionstranslate to recv and send, the socket handle can thus be used exactly as anormal file handle.

More information on the Windows networking architecture can be found inthe MSDN Library[12] and the Windows DDK[13].

13


18/57

4 Implementation

TDI

NDIS

Kernel mode socketsKernel mode web server

User mode

Kernel mode

LibraryNDIS

User mode web server

Specialized WindowsSockets

Application

WS2_32.DLL

MSAFD.DLL

Windows SocketsService Provider

WSHTCPIP.DLL

Windows Sockets Helperfor TCP/IP

Windows Sockets Helperfor other protocol

Transport provider forother protocol

TCPIP.SYS

Transport provider for TCP/IP

NDIS.LIB

HAL.DLL

HAL

NDIS Intermediate driver

NDIS Miniport driver

AFD.SYS

for Windows Sockets

Application

Windows Sockets 1.1

Windows Sockets 2.0

WSOCK32.DLL

Transport providers

TDI clients

Ancillary Function Driver

Implementation

First this section describes the methods that where used to implement aprototype for Windows, following that are the components of the prototypedescribed in more detail.

4.1 Methods

This section describes the methods that where used to implement a proto-type for Windows.

4.1.1 Avoiding splitting sends

When a web server answers a HTTP GET request it first sends a header tothe client followed by the file data. If one simply calls the TCP/IP-stack to

send the header followed by one or more calls to send the file data it is likelythat the header will be sent in its own network packet before the next datais given to the TCP/IP-stack. For small files this will double the number ofnetwork packets. An important optimization for web servers is therefore tosend the header and the first file data in one packet.

On Linux this is accomplished with the TCP CORK socket option, and in usermode on Windows can the TransmitFile() function be used, it takes botha file handle and a header and trailer to send together with the file data.However in kernel mode on Windows is non of these possibilities available,

14


19/57

instead is the TDI interface used. The TDI interface is not using normal

pointers to memory areas to describe buffers but is instead using somethingcalled Memory Descriptor Lists (MDLs). An MDL is an internal struc-ture used by Windows to describe the physical pages that back a virtualmemory range. After allocating MDLs that represent the buffers containingthe header and file data they can be linked together using the Next field, anexample from the source code is showed here:

headerMdl->Next = fileMdl;

This is called an MDL-chain. When the TCP/IP-stack is given the MDL-chain it can program the network card to do DMA directly from the different

buffers that the MDL-chain describes.

4.1.2 Zero-copy networking

Eliminating the copy of data from the file system cache to the TCP/IP-stackwhen sending is called zero-copy networking. To truly get zero-copy network-ing the network card and its driver must support scatter/gather DMA andcalculation of TCP and IP checksums in hardware. The TDI interface isusing MDLs to describe the buffers to read data from and therefore mustone get an MDL that represents the data in the file system cache. Simplycreating a buffer and allocating an MDL for it would mean a copy opera-tion, however the file system interface as defined by Windows includes thepossibility to request an MDL that directly represents the physical pages inthe file system cache.

This is done by setting the minor function code in the I/O-stack locationto IRP MN MDL and the major function code to IRP MJ READ. When theread IRP completes is an MDL returned in Irp->MdlAddress. After useis this MDL returned to the file system driver using the minor function codeIRP MN COMPLETE MDL. The file system driver typically implements supportfor IRP MN MDL and IRP MN COMPLETE MDL with help of the cache managerfunctions CcMdlRead() and CcMdlReadComplete() . The fast-I/O interface

has the corresponding functions MdlRead() and MdlReadComplete().

Note that when not setting the minor function code to IRP MN MDL it is pos-sible to specify a pre-allocated MDL in Irp->MdlAddress that representsa buffer that the data will be read into, this is the normal way to do I/Ousing MDLs, however when specifying IRP MN MDL the Irp->MdlAddress isnot initiated before calling the file system but will instead on complete ofthe call be set to an MDL allocated by the file system driver and directlyrepresenting the data in the file system cache.

As a curiosity it could be mentioned that there is minor function codes and

15


20/57

fast-I/O functions to read compressed files in raw format, that is without

decompressing them, the HTTP protocol allows a web browser to notifythe web server that it supports different compression algorithms, if a webbrowser where to support the compression algorithm used by the NTFS filesystem this could be used to decrease the amount of data transferred onthe network without increasing the server load or storage requirement. Acommon way to implement support for compressed files in a web server isotherwise to store the files both un-compressed and compressed with oneor more compression algorithms using different file name extensions, theweb server can then choose the version of a file to send depending on whatcompression algorithms the client supports.

4.1.3 Handing over a request to user mode

Since the basic idea is to let a kernel mode web server handle some requestswhile others is handed over to a user mode web server an important questionis how the hand over is done, this section describes different methods to dothis.

HTTP redirect: The HTTP protocol allows a web server to send a redi-rect response to a request from a web browser, the browser then does a newrequest from another location, the redirect response could be used for the

requests that should be handed over to the user mode web server, for exam-ple by changing the port number. However this would double the number ofnetwork connections needed and increase the load on the server. Because ofthe low performance this method is not used in any existing project as faras known but an advantage is that it is easy to implement and it could beused in a prototype during experimentation.

Socket to socket copy: When the kernel mode web server finds that arequest should be handed over to the user mode web server it could simplycopy the data between the socket belonging to the remote client and a socket

belonging to the user mode web server, possible creating a new thread foreach connection or using non-blocking socket calls. This method was usedin the first versions of kHTTPd but to increase performance it was soonreplaced with the method described in the next paragraph. An advantageof this method is that it is rather simple to implement, it could be used forthe prototype on Windows but one would need to investigate the affects onperformance.

16


21/57

The Linux way: Newer versions of kHTTPd and TUX use a particular

efficient method to hand over a request to the user mode web server, theydirectly enters it in the accept queue for the socket it is listening on, this canbe done since a kernel module in Linux has access to data structures andfunctions internal to the operating system and the source code is available.When reading the HTTP request the kernel mode web server uses the flagMSG PEEK that returns the arrived data without removing it from the bufferin the TCP/IP-stack, if the request is to be handed over to the user modeweb server the data is still there to be read again. After the request is movedto the new socket is the original socket empty and the process waiting for aconnection on the new socket will be waken up as if the connection had beendone to that socket from the beginning. Of the methods described this one

has the highest performance since it avoids any extra copying of data andis transparent to the system, however it can only be done if the source codefor the operating system is available like it is for Linux. It is not possible tomodify the Windows Sockets implementation to use this method.

Replacing the original Windows Sockets implementation: The met-hod described in the last paragraph to hand over a request from the kernelmode web server to the user mode web server is the most efficient one, thereason it can not be used on Windows is that the source code to the Win-dows Sockets implementation is not available, however it is still possible to

solve the problem based on this method by replacing the Windows Socketsimplementation with one that is aware of the relation between the kernelmode web server and the user mode web server. This specialized WindowsSockets implementation would hand over a request to the user mode webserver if it can not be processed by the kernel mode web server. The usermode web server is like other network applications dynamically linked to aDLL that implements the Windows Sockets API, as is described in one of thefollowing sections it is possible to replace this DLL for only the user modeweb server while other network applications will continue to use the ordinaryimplementation. Advantages with this method is that it will probably be asefficient on Windows as on Linux and that the user mode web server can be

unmodified, an disadvantage is that it is a lot of work to implement it.

A new protocol family with a Windows Sockets Helper DLL: Win-dows Sockets version 2 is a general interface that can be extended to supportnew protocols. A new protocol is implemented in two parts, one kernel modetransport provider and one Windows Sockets Helper DLL. To use the newprotocol an application specify it as the parameter called protocol familyor address family to the socket() function. The common processing willbe done by the ordinary Windows Sockets implementation while it will call

17


22/57


23/57

SafeDllSearchMode to change the search order, if set to 1 the current di-

rectory is searched after the system and Windows directories, but beforethe directories in the PATH environment variable. On Windows XP SP1and Windows Server 2003 is this registry entry set to 1 as default.

If this specialized Windows Sockets only re-implement some of the functions,or only pre- or post-process them, it is convenient to be able to call theoriginal DLL, this can be done with help of the functions LoadLibrary()and GetProcAddress(). An example shows how this can be done for thefunction gethostname(), first is a function pointer and a variable to storea pointer to the original function declared:

typedef int (__stdcall *pf_gethostname)(char *name, int namelen);

static pf_gethostname win_gethostname;

At the initialization of the DLL is a pointer to the original function requestedusing the functions LoadLibrary() and GetProcAddress():

win_gethostname = (pf_gethostname)

GetProcAddress(LoadLibrary("c:\path\filename.dll"), "gethostname");

Then can a call to gethostname() easy be redirected to the same functionin the original DLL:

int __stdcall gethostname(char *name, int namelen)

{

return win_gethostname(name, namelen);

}

An important point to keep in mind when replacing a DLL is that an ap-plication can not only link to functions by name but also by number, thisnumber is called ordinal. To let a function have the same ordinal in thenew DLL as in the one being replaced one must create a Linker DefinitionFile, a part of it defining the ordinal for the function gethostname() isshown here:

EXPORTS

...

gethostname @57

4.1.5 Socket handles as file handles

The socket handles used by Windows Sockets is normal file system handles,this means that an application can call file I/O functions like read() and

19


24/57

write() with the socket handle and the I/O manager will redirect the call

to the driver that the corresponding file object belongs to. Since this is thedriver implementing the kernel mode part of Windows Sockets (AFD.SYS)it can translate the read or write call to recv and send respectively.

Windows provides functions to help the implementer of a new protocol to re-quest and manage this kind of handles, these functions are WPUCreateSocketHandle(), WPUQuerySocketHandleContext() and WPUCloseSocketHandle() .

It is also possible for a new protocol to implement this feature by acting asa full file system driver, however that is a large and difficult implementationto do.

4.2 Prototype

This section describes the components of the prototype.

4.2.1 Simple kernel mode sockets

A major part of the project was to implement the kernel mode sockets li-brary. This was difficult for two reasons, first it is more difficult to developand debug drivers than applications, especially on Windows and secondthe interface to network communication from kernel mode is in large part

undocumented. The TDI interface in itself is documented[13] but this doc-umentation is directed to implementers of new network protocols, there isalmost no documentation for developers of TDI clients and most notably isthe existing protocols like TCP/IP undocumented. During the work I hadgood help from postings on the Internet by Gary Nebbett and Maxim S.Shatskih.

The TDI interface is a standard driver interface based on IRPs. To use theTDI interface one opens a named device object representing the protocoloff choice. The open operation will create a file handle and a file object isrequested with ObReferenceObjectByHandle() . This file object can then be

used in further calls to the TDI interface to identify the socket. To makea call an IRP is created with TdiBuildInternalDeviceControlIrp() orIoAllocateIrp() and initialized with the help of one of the TdiBuildXxx()macros. The IRP is then sent to the transport driver using IoCallDriver()and it is possible to either wait on an event for it to complete or specify ancompletion routine that is called when the request completes.

The TDI interface uses two kinds of file objects; representing transport ad-dresses respective connection endpoints, a transport address is always neededwhile a connection endpoint is used for TCP sockets, a connection endpoint

20


25/57

is associated with a transport address and a transport address can be asso-

ciated with more than one connection endpoint.

It is also possible to register different event handlers on a transport address,when a network event occur is the corresponding event handler called, sincethe event handlers is called at DPC level what they are allowed to do islimited.

Following is comments on my experiences trying to implement a simplesocket library for kernel mode on Windows using the TDI interface. It isnot complete and focus on what is needed by the web server prototype butit includes some functionality not used by the web server prototype likeconnect() and UDP sockets.

Creating a socket:

int __cdecl socket(int af, int type, int protocol);

A socket is created with the function socket(), it takes parameters to spec-ify the kind of socket, the only kind of interest for a web server is streamsockets using the TCP protocol. The return value of the socket function isa handle that is used as a parameter to the functions operating on a socket.

There is no need to call the TDI interface at the time of creating a socket,

instead this function just allocates an internal data structure to keep infor-mation about the socket and returns a handle to it. To keep this implemen-tation simple I decided to use minus the address of the internal struct as ahandle to the socket, using the address of the struct as a handle makesit easily available and the reason for using minus the address is that sincea handle is 32-bit and addresses of memory allocated in kernel mode onWindows is above 2GB, minus the address will be a positive number whencasted to a signed integer while negative numbers can be used as error codes.Off cause this method should be replaced by using real handles, perhapsusing the RtlGenericTable functions or even using file system handles.

Binding the socket to a local address and port number:

int __cdecl bind(int socket, const struct sockaddr *addr,

int addrlen);

For a server the next step is usually to associate the socket with a local portnumber and possible an address, this is called binding.

The bind() function will call the TDI interface to do two things, createa transport address and possible set up one or more event handlers. The

21


26/57

file handle and file object representing the transport address is saved in the

internal socket struct for later use. The only event handler needed to be setin my implementation was for the disconnect event.

Making an connection:

int __cdecl connect(int socket, const struct sockaddr *addr,

int addrlen);

A web server does not need the connect() function but I implemented itanyway for completeness and it is shortly described here. If the socket is

not bound will the connect() function begin by doing an implicit bind, thismeans that the TCP/IP-stack will assign a free port number to it. Thenis something called a connection endpoint created, for information on whatthis is see the listen() function. Finally is a TDI CONNECT IRP sent to theTDI interface.

Listen for connections:

int __cdecl listen(int socket, int backlog);

The listen() function is used by a server to tell that a socket shouldlisten for connection attempts. A connection can then be accepted usingthe accept() function.

To understand how listen() and accept() relate to the TDI interface andimplementing them was the most difficult part of the socket library.

The TCP/IP-stack will listen for connection attempts either if a TDI LISTENcall is made or if a connect event handler is installed on the transport ad-dress. The TDI LISTEN call will complete when a connection attempt ismade, if the flags are 0 is the connection already accepted at that timeso no TDI ACCEPT call is needed, if the flags are TDI QUERY ACCEPT one

can either make a TDI ACCEPT call to accept the connection attempt ora TDI DISCONNECT call to deny it. The connect event handler builds an IRPthat will be sent to the TDI interface, it can either be a TDI ACCEPT IRP toaccept the connection attempt or a TDI DISCONNECT IRP to deny it.

In my simple implementation I did not use an connect event handler butinstead let the listen() function only create the connection endpoint, themajor part of the work is then done by the accept() function. This meansan deviation from the BSD socket standard since the socket wont be listeninguntil accept() is called but this wont affect the prototype developed in this

22


27/57

project. Another disadvantage with this method is that it corresponds to a

fixed backlog of only 1.

In a more elaborate implementation one would instead let listen() createa number of connection endpoints and install an connect event handler onthe transport address.

Accepting an connection:

int __cdecl accept(int socket, struct sockaddr *addr, int

*addrlen);

The accept() function accepts an connection and returns a new socketthat represents it while the old socket can be used again to accept moreconnections.

In my implementation does accept() start by issuing TDI LISTEN with flagsset to 0, this means to wait for a connection attempt and accepting it. Whenthe TDI interface is used in this way is TDI ACCEPT never needed. After theTDI LISTEN call completes does my implementation create and initiate anew socket, the connection endpoint is associated with this socket since itis now connected to the remote node. Finally is a new connection endpointcreated and associated with the old socket so it can be used to accept more

connections.

In the more elaborate implementation suggested in listen() above theaccept() function would instead wait for an event to be signalled by theconnect event handler that does the most part of the work. The connectevent handler would do two things, build an TDI ACCEPT IRP to accept theconnection attempt and create a new connection endpoint to replace theone being used so that the backlog still has the same size. The event thatthe accept() function is waiting on could be signalled by the completionfunction for the TDI ACCEPT IRP. Note that since the connect event handlerruns at DPC level it can not create a new connection endpoint from withinit but must instead schedule a work-item to do that.

However I stayed with the simpler implementation in this project.

Sending data:

int __cdecl send(int socket, const char *buf, int len, int

flags);

int __cdecl sendto(int socket, const char *buf, int len, int

flags, const struct sockaddr *addr, int addrlen);

23


28/57

The functions to send data can be a bit tricky to implement, for example the

TCP protocol provides a reliable byte stream and might need to re-transmita packet if it is lost on the network, there fore doesnt the TDI SEND requestcompletes as soon as the data has been scheduled to be sent, in the casea re-transmit is needed must the buffer remain available to the TCP/IP-stack, instead it will complete after an acknowledge has been received forall packets that were used to send the data in the buffer. However lettingsend() block until then would mean an inefficiency if the user does severalsmall sends after one another, therefore a natural implementation is to letsend() copy the data to an internal buffer and sending it from there insteadso that send() can return immediately, only when the internal buffer is fullwould send() block. Since the web server prototype doesnt use send() at

all but instead a new function specialized to send file data I didnt implementthis buffering in the normal send() but left it as it is. When sending UDPdata this shouldnt be an issue since UDP doesnt use re-transmit.

To send file data as efficiently as possible an MDL representing it in the filesystem cache is requested and the header and the first data is put togetherto a chained MDL. To send this I created a send function that takes anMDL instead of buffer and length parameters. The MDL is given to theTDI interface to be sent where after the function returns. A completionroutine specified by the caller is run when the call completes to de-allocatethe buffer and MDL.

A question that is raised since the send call is non-blocking is how manyTDI SEND requests to have outstanding, the documentation says that a sendrequest may return STATUS DEVICE NOT READY and that there is an eventhandler send possible to be called when the TDI interface is ready to sendagain. However for the file sizes used during performance measurements onthe web server I didnt had any problems with the send implementation soI left this question unanswered.

Receiving data:

int __cdecl recv(int socket, char *buf, int len, int flags);int __cdecl recvfrom(int socket, char *buf, int len, int flags,

struct sockaddr *addr, int *addrlen);

The functions to receive data was more straight forward to implement thanfor sending, in my implementation I just create an MDL for the users buffer,make a call to the TDI interface and wait for it to complete. I didnt findthe need to implement MDL versions of the receive functions for the webserver prototype but that could be done to make the socket library morecomplete.

24


29/57

The TDI interface allows registering an event handler that is called when

data is received, doing that wasnt needed in this project but in the con-clusions section I suggest that performance could be improved by analyzingand responding to a HTTP request from an receive event handler, howeverif sending file data from an event handler it must be cached in non-pagedmemory.

Getting and setting socket options:

int __cdecl getsockopt(int socket, int level, int optname, char

*optval, int *optlen);

int __cdecl setsockopt(int socket, int level, int optname, constchar *optval, int optlen);

The functions getsockopt() and setsockopt() is used to get and set op-tions on a socket, the prototype didnt need the use of any particular socketoption so I left them unimplemented.

Query the local or remote address and port number:

int __cdecl getsockname(int socket, struct sockaddr *addr, int

*addrlen);int __cdecl getpeername(int socket, struct sockaddr *addr, int

*addrlen);

The function getsockname() returns the address and port number of thelocal socket if it is bound. This information can be requested from the TDIinterface using the minor function code TDI QUERY INFORMATION and thequery type TDI QUERY ADDRESS INFO. When first trying this call I noticedthat the address returned was always 0.0.0.0 while the port number wascorrect. The documentation says:

TDI QUERY ADDRESS INFO specifies that the transport should re-turn the information, formatted as a TDI ADDRESS INFO struc-ture, in the client-supplied buffer mapped at MdlAddr. FileObjmust point to an open file object representing a transport addressor a connection endpoint already associated with a transport ad-dress.

Note that it doesnt say if it matters if a file object representing a trans-port address or a connection endpoint is used. After experiencing that the

25


30/57

address returned was always 0.0.0.0 I tried using the file object represent-

ing a connection endpoint instead of the transport address and indeed theaddress was correctly returned! This led to the following implementation ofthe function getsockname() for TCP sockets:

else if (s->type == SOCK_STREAM)

{

*addrlen = sizeof(struct sockaddr_in);

return tdi_query_address(

s->streamSocket && s->streamSocket->connectionFileObject ?

s->streamSocket->connectionFileObject : s->addressFileObject,

&localAddr->sin_addr.s_addr,&localAddr->sin_port

);

}

Note that if the user calls getsockname() after bind() but before connect()or listen() the socket has a port number but no connection endpoint,therefore one must query the transport address in this case.

The function getpeername() returns the address and port number of theremote node, this is always known and does not require a call to the TDIinterface, it is saved in the private socket struct either when connecting toa given address or when accepting an connection attempt at witch time itis provided by the TDI LISTEN call or the connect event handler.

Functions to convert between host and network byte order:

u_long __cdecl htonl(u_long hostlong);

u_short __cdecl htons(u_short hostshort);

u_long __cdecl ntohl(u_long netlong);

u_short __cdecl ntohs(u_short netshort);

Computers use either big endian or little endian byte order while Internetaddresses and port numbers always is transferred in big endian byte order.The functions htonl() and htons() is used to convert between the hostsbyte order and network byte order for long and short integers respectively,the functions ntohl() and ntohs() is used to transfer between network byteorder and the hosts byte order but off cause they do the same thing as theformer functions. For a big endian system this functions do nothing but sincethe x86 CPU is using little endian byte order they should be implementedto swap the bytes, this can be done as a pre-processor macro, anyway theTDI interface is not involved.

26


31/57

Functions to convert between ASCII and integer representation

of addresses:

u_long __cdecl inet_addr(const char *name);

int __cdecl inet_aton(const char *name, struct in_addr *addr);

char * __cdecl inet_ntoa(struct in_addr addr);

These functions is used to convert between ASCII and integer representa-tion of addresses, like the byte order functions they are straight forwardto implement and does not require the use of the TDI interface, howeverone issue to mention is that inet ntoa() returns a pointer to an string, ifthis is a statically allocated string witch is common in user mode will the

implementation not be thread safe, on the other hand if it is dynamicallyallocated like I choose to do it must be de-allocated by the caller after usewitch is an incompatibility from the BSD socket standard.

Shutting down a socket:

int __cdecl shutdown(int socket, int how);

I implemented the function shutdown() to call TDI DISCONNECT with theflags set to TDI DISCONNECT RELEASE, for more information on this request

see close().

Closing a socket:

int __cdecl close(int socket);

It is almost optional to use event handlers but in fact for TCP sockets thereis one event handler that must be used and that is the disconnect event. Toproperly close a TCP connection and be sure that all data sent is transferredto the remote node one must begin by calling TDI DISCONNECT with the flags

set to TDI DISCONNECT RELEASE, however waiting for this IRP to completeand then closing the file objects representing the transport address and theconnection endpoint might close them prematurely and data is lost, insteadone must wait for the disconnect event to be called witch it is after the TCPprotocol has negotiated the closing of the connection.

Note that the flag TDI DISCONNECT RELEASE does a graceful close but italso exists a flag TDI DISCONNECT ABORT to do a hard close of the socket, inthis case will the disconnect event handler not be called. However when doinga graceful close but the remote node does a hard close will the disconnect

27


32/57

event handler be called, the flags parameter to the disconnect event handler

tells what kind of close the remote node did.

In my implementation I choose to install the disconnect event handler whencreating the transport address in bind() and send TDI DISCONNECT from ei-ther shutdown() or close() and then finally wait for an event to be signalledfrom the disconnect event handler. After this wait is the file objects repre-senting the transport address and the connection endpoint de-referenced andtheir handles closed. The TDI DISCONNECT ABORT flag was not needed.

DNS lookup functions:

struct hostent * __cdecl gethostbyaddr(const char *addr, int

addrlen, int type);

struct hostent * __cdecl gethostbyname(const char *name);

int __cdecl gethostname(char *name, int namelen);

struct protoent * __cdecl getprotobyname(const char *name);

struct protoent * __cdecl getprotobynumber(int number);

struct servent * __cdecl getservbyname(const char *name, const

char *proto);

struct servent * __cdecl getservbyport(int port, const char

*proto);

This collection of functions is used to query the DNS, for example to get theInternet address of a named host. Since this is not needed by the web serverprototype all of these functions is un-implemented however I will commenton how to implement them.

A straight forward method is to just implement the functions using thesocket library, the DNS query protocol is simple and there should be noproblems implementing it for kernel mode on Windows. Another possibilityis to implement them with the help of a user mode service process thatthe kernel mode part communicates with. The user mode process can thenforward the call to the ordinary Windows Sockets API and finally return

the result to the kernel mode sockets library. An advantage of this methodis that if the ordinary Windows implementation is changed or re-configuredwill the kernel mode sockets library automatically make use of this.

The kernel mode socket implementation compiles to a static library that adriver can link to.

28


33/57

4.2.2 pthreads

pthreads or the POSIX Threads API is a standard that defines how tocreate multiple threads in a process and how to synchronize and communi-cate between them. To support developing the prototype I implemented asimple pthread library for Windows drivers limited to the functions used bythe prototype. It had off cause been possible to let the prototype use theWindows API directly but an advantage with this method is that it makesit easy to borrow code between Windows and UNIX and user mode and ker-nel mode while developing the prototype and experimenting with differentsolutions.

Only the functions to create a thread, terminate a thread and wait for athread to terminate was needed. Following is comments on their implemen-tation.

Create a thread:

int __cdecl pthread_create(pthread_t *thread, const pthread_attr_t

*attr, void *(*start_routine)(void*), void *arg);

The function pthread create() creates a new thread using PsCreateSystemThread() and optionally saves a handle to it if the parameter pthread t*thread is non-NULL. If a handle to the thread is saved it must be waitedon with the function pthread join() or closed with ZwClose().

Terminate a thread:

void __cdecl pthread_exit(void *value_ptr);

The function pthread exit() terminates the current thread by callingPsTerminateSystemThread() . It is also possible for a thread to terminateby simply returning from its main function.

Wait for a thread to terminate:

int __cdecl pthread_join(pthread_t thread, void **value_ptr);

The function pthread join() waits for a given thread to terminate us-ing ZwWaitForSingleObject() , if the threads exit status is requested it isqueried using ZwQueryInformationThread() . If one has saved an handle toa thread it must be waited on with this function or closed with ZwClose().

29


34/57

The pthread implementation compiles to a static library that a driver can

link to.

4.2.3 Kernel mode web server

The kernel mode web server developed is a prototype intended to be usedfor performance measurements and evaluation of different methods thereforeit only implements the HTTP GET request. It uses the socket and pthreadlibrary described above. Following is a description of its implementation.

Web server: The main part of the web server is simple, the DriverEntry()

function creates a socket and a thread that waits for connections to it. Forevery connection is separate thread created to handle the request, the threadwill read the HTTP request and if it is a HTTP GET request and the re-quested file exists will it be sent using send file mdl(). Other HTTP re-quests is not supported, there is also limited support for the different HTTPheaders that can help web browsers and proxy servers to be more efficient.

File-I/O: The most interesting part of the web server prototype is thefile-I/O functions since they use the MDL interface to utilize zero-copy net-working.

HANDLE dopen(PUNICODE_STRING dirName);

The function dopen() uses ZwCreateFile() to open a handle to a directory.

HANDLE fopen(char *fileName, HANDLE rootDir);

The function fopen() uses ZwCreateFile() to open a handle to a file. Anice feature of ZwCreateFile() that is used here is the possibility to do aso called relative open, this means that a file handle to an already opened

directory is given, the path and file name is then interpreted to be relativeto this directory. This is used by the web server to open the requested filerelative to the directory that has been specified as the document root.

Another thing worth to mention here is that I specify the flag FILE SEQUENTIALONLY to inform the cache manager that the file will be read sequentially, fromthe beginning to the end, the cache manager can then use this informationto optimize the caching of the file.

int fread(HANDLE fileHandle, void *buf, int len);

30


35/57

The function fread() reads file data with ZwReadFile() to a buffer, it is

included here only for demonstration and comparison since the web serverinstead should use fread mdl().

int fread_mdl(PFILE_OBJECT fileObject, PLARGE_INTEGER offset,

ULONG len, PMDL *mdl);

The fread mdl() function uses the special MDL interface to the file systemto request an MDL representing the data in the file system cache. It allocatesat initializes a normal read IRP but sets the MinorFunction to IRP MN MDLand leave the Irp->MdlAddress un-initialized. When the call completes has

the file system driver set Irp->

MdlAddress to an MDL that is returned tothe caller.

int fread_mdl_complete(PFILE_OBJECT fileObject, PMDL mdl);

The function fread mdl complete() is used to return (de-allocate) an MDLrequested with fread mdl(), it sets the MinorFunction to IRP MN COMPLETEMDL and Irp->MdlAddress to the MDL that should be returned to the filesystem.

void fclose(HANDLE fileHandle);

The function fclose() uses ZwClose() to close the file handle.

int send_file(int sock, char *header, HANDLE fileHandle);

The function send file() uses fread() and send() to send a header fol-lowed by file data to a socket. It is only included for demonstration andcomparison with send file mdl().

int send_file_mdl(int sock, char *header, HANDLE fileHandle);

The function send file mdl() uses fread mdl() to request an MDL repre-senting the file data, for the first block it links the file data MDL to an MDLfor the header. Then the MDL is sent and an completion routine specified tode-allocate it with fread mdl complete(). This is repeated in a loop untilall of the file data has been sent.

static void send_mdl_complete(int status, void *context);

31


36/57

The function send mdl complete() is run at DPC level when the send re-

quest has completed and the TDI interface doesnt need the buffer any more,it queues a work item to do the de-allocation since that is not allowed atDPC level.

static void work_item(void *context);

The function work item() is finally run and uses fread mdl complete() toreturn the MDL to the file system and de-allocates other resources used.

5 Performance measurements

This section describes how to measure performance of web servers and theresults from measuring the prototype and other web servers for Windowsand Linux.

5.1 Method

5.1.1 What to measure

The most common parameters to measure is latency (time to complete arequest) and throughput (bytes/second), it is also common to study theweb servers performance for different file sizes, for small files is the mosttime spent on setting up the TCP connection and process the request so thenumber of requests per second gives good information on the performance,for bigger files is the performance more affected on the work of reading filedata from the file system and sending it to the TCP/IP-stack so throughputwill better show the difference between web serves here.

When the Standard Performance Evaluation Corporation (SPEC) developedtheir SpecWeb benchmark they analyzed log files from different web serversto determine the file size distribution of the requests handled by a typical

web server. The result is listed in table 5 and shows that rather small filesis most common.

Files less than 1KB 35%Files between 1KB and 10KB 50%Files between 10KB and 100KB 14%Files between 100KB and 1000KB 1%

Table 5: File size distribution of requests to a typical web server

32


37/57

Another thing to measure is how the web server handles a large number of

simultaneous connections, this will show if the connections is handled in anefficient way or if more and more processing time goes to just administratingthe connections. One should not forget that even idle connections couldimpose a load on the server.

Related to this is testing the web servers overload behaviour, this meansto connect a larger number of clients making more requests than the webserver can handle. Depending on the design of the web server can two thingshappen, ideally will many of the clients be processed rather fast while othersget connection timeout or connection refused, this means that the server willmake progress and that the load will eventually decrease again. On the otherhand it could also happen that the new requests fill up the servers resourcesand make the processing slower and slower for all clients until the serverappears to have locked up, this condition is called receive livelock[14].

5.1.2 Benchmarks

This section describes the most common benchmarks for web servers.

httperf: The httperf benchmark[15] is developed by David Mosberger atHewlett-Packard Research Labs. The httperf benchmark is a flexible HTTP

client that requests a file from a web server a number of times and possiblefrom a number of parallel threads and then prints out detailed statistics.The most important statistics is the number of requests per second andthe network throughput but httperf also gives timing information with minand max limits and information on the errors if any. An example of theoutput from httperf is shown below. Httperf does not use any particular fileset, it is left to the tester to choose them. A strength of httperf is that itcan measure how the web server can handle a load with a large number ofconcurrent connections and records any timeouts or other errors that occur.The httperf benchmark is available for free in source code form and can becompiled for most versions of UNIX.

$ httperf --timeout=60 --server=192.168.1.1 --port=80 --uri=/www/64.jpg

--num-conns=1000

Maximum connect burst length: 1

Total: connections 1000 requests 1000 replies 1000 test-duration 0.527 s

Connection rate: 1897.3 conn/s (0.5 ms/conn,


38/57


39/57

one buys a license. The retail version of SpecWeb99 cost $800 and a license

for non-profit or educational use cost $200.

5.1.3 Things to consider

This section discusses things to consider when measuring web server perfor-mance.

The network speed: To get a valid measurement result it is important toavoid that the network speed is a bottleneck but that instead it is the servershardware and software that sets the limit. It is mostly when measuring

throughput on larger files that this may become an issue, if the networkspeed is to low the graphs for the different web servers will come closerto each other and approach the theoretical maximum. My measurementsshowed that even if the pattern was the same the spread among the webservers was bigger on 100Mb/s Ethernet than on 10Mb/s Ethernet.

The presence of an anti-virus program: At the first measurements Ireceived rather low performance on Windows compared to Linux, thereforeI investigated the system to see if there was any difference in the setup thatcould cause problems on Windows. Among other things I checked that the

network card was fully supported by the device drivers on both Linux andWindows and that it was run in the same mode; duplex, DMA and calcula-tion of TCP checksums in hardware. After some investigation I noticed thatNorton Antivirus was installed and running on the Windows system, dis-abling it increased the performance with 20% to 25%, a significant change.The reason for this is that Norton like other antivirus programs installs afile system filter driver that hooks into the file system operations, when aread request is done by an application the file system filter will scan thefile data for viruses before returning the data to the application, this willintroduce an performance degradation. However there have existed antivirusprograms that instead scanned the file content on every open request, they

reduced the performance a lot more! To do a fair comparison between Linuxand Windows it is important to check for these kinds of differences.

The maximum number of open file descriptors: To benchmark theweb servers I used httperf and run it on Linux. A limitation that I noticedduring testing with a large numbers of simultaneous connections was thenumber of open file descriptors allowed, a socket handle is a file descriptor.The maximum number of open file descriptors has changed during the ver-sions of the Linux kernel, on the system I used it was 8192, the following

35


40/57

command shows how to tell the limit:

$ cat /proc/sys/fs/file-max

8192

The following command run as root increases the limit to 65535:

# echo "65535" > /proc/sys/fs/file-max

Unfortunately the maximum number of open file descriptors is also compiledinto programs using the select() system call like httperf with a limit of

only 1024. To correct this one must edit the line

#define __FD_SETSIZE 1024

in the files /usr/include/bits/types.h and /usr/include/linux/posix types.hto the new value and recompile the program.

Depending on the system configuration it might also be needed to increasethe per user limit of open file descriptors, for sh this is done with the com-mand:

# ulimit -n number

and for csh with the command:

# limit descriptors number

A related problem is that if a program doesnt specify an local address andport number when calling bind() is the port number allocated from a limitedrange, by default the range is 1024 to 4999, allowing 3975 simultaneousoutgoing connections. This range can be viewed with the command:

$ cat /proc/sys/net/ipv4/ip_local_port_range

1024 4999

The following command run as root increases the range up to 65535:

# echo "1024 65535" > /proc/sys/net/ipv4/ip_local_port_range

36


41/57

The TIME WAIT state: When a TCP connection is closed does the

protocol go into a state called TIME WAIT that lasts for 60 seconds ortwice the maximum segment lifetime. One of the reasons for this is to avoidthat duplicate packets who has not arrived yet will be mistaken for valid dataon a new connection using the same port. However since the port numberis a 16-bit integer there can only be at most 65536 connections at the sametime and a port will be unavailable for a minute after closing the connection.This may become a bottleneck both on the client side where the benchmarkprogram is run and on the server side when measuring a high connectionrate for a long time, one workaround is to only make short measurementsfor small files that will give the highest connection rate, it is also possibleto avoid or shorten the TIME WAIT state but that means introducing an

incompatibility with the TCP/IP-protocol.

The maximum listen backlog: The function listen() takes a param-eter called backlog that specifies how many connections can be pendingwaiting for accept(). This is the number of connections that may arrivewhile the server processes an accepted request, possible by creating a newthread to handle the client, before it calls accept() again. If more clientsthan the backlog number try to connect during this time they will get con-nection refused. The web server can specify a backlog number when it callslisten() but the operating system usually has some limit and will not

use a backlog number above this. The maximum listen backlog for differentenvironments is listed in table 6 .

Windows Sockets on client versions of Windows 5Windows Sockets on server versions of Windows 200TDI in kernel mode on Windows Decided by

programmerLinux 128

Table 6: Maximum listen backlog

5.1.4 Measurement setup

The measurements were done in a lab at the department, the computersare equipped with AMD Athlon 1133MHz CPU and 128MB RAM. Theyare configured for dual boot and can either run Windows 2000 Professionalor Gentoo Linux[18]. When doing the measurements the computers whereconnected to a separate Ethernet switch. As benchmark program httperfwhere used because of its flexibility, it was run on Linux while the servercomputer either ran Linux or Windows depending on witch web server to

37


42/57

measure.

5.2 Results

This section shows diagrams with the results from the measurements, thecomments are in the next section. Measurement on 10Mb/s Ethernet:

Number of requests per second

Throughput in kB per second

38


43/57

Measurement on 10Mb/s Ethernet, comparison:



39


44/57

Measurement on 100Mb/s Ethernet:



40


45/57




41


46/57

Measurement on 1000Mb/s Ethernet:



42


47/57




43


48/57

The effect of an anti-virus filter:



44


49/57

The effect of an anti-virus filter, comparison:



45


50/57

6 Conclusion

This section describes the conclusions drawn from the project and suggestsfuture work.

6.1 Difficulties

The main difficulty with the project was to implement kernel mode codefor Windows. Developing device drivers is more challenging than writinguser mode code both on Windows and on Linux, among other reasons isthere less support from the debugging tools and a small error in a driver

can make the whole operating system unstable or crash. Such errors areoften difficult to debug. Another difficulty is that on Windows is the kernelmode interface to network communication in large part undocumented. TheTDI interface in itself is documented but this documentation is directed toimplementers of new network protocols, there is almost no documentationfor developers of TDI clients and most notably is the existing protocols likeTCP/IP undocumented. Because of this a major part of the project wasdeveloping the kernel mode sockets library.

6.2 Performance

The measurements showed that kernel mode web servers has a substantialperformance gain over user mode web servers both on Linux and on Win-dows.

The factor that is improved most in kernel mode is the latency witch meansthat a larger number of requests per second can be processed, this is mostvisible for small files where the kernel mode web servers on Linux is 50% to75% faster than user mode web servers while the prototype on Windows wasas much as 200% to 300% faster than user mode web servers on the sameplatform, for bigger files decreases the difference to only 0% to 25%. Thereason for the big improvement in latency is probably that a kernel mode

web server avoids unnecessary process scheduling.

The throughput is also improved by the kernel mode web servers but thedifference decreases as bigger the files is. For small and medium sized fileshas the kernel mode web servers on Linux 25% to 50% higher throughputand the prototype on Windows 100% to 200% higher throughput comparedto user mode web servers on the same platform, for bigger files decreasesthe difference to 0% to 25%. The main reason for the higher throughputfrom the kernel mode web servers is probably that they avoid unnecessarycopying of data in addition to the fact that lower latency also helps getting

46


51/57

higher throughput.

When comparing absolute performance between the two platforms one makesthe observation that user mode web servers is faster on Linux than on Win-dows, all kernel mode web servers is faster than the user mode web serversand the kernel mode web servers on Linux is slightly faster than the proto-type kernel mode web server developed for Windows.

I also tried to do measurements on 1000Mb/s Ethernet but the Linux driverfor the network card was a beta version and produced unpredictable results,however the diagrams is included and the measurements on Windows is validand show the same pattern that the measurements on 10Mb/s Ethernet and100Mb/s Ethernet.

Also for informational purposes I included the results that showed that ananti-virus filter may decrease the performance as much as 25% for small fileswhile the effect is smaller for bigger files. Even though the anti-virus filterscans a file for viruses on read it mainly affects the latency, the reason forthis is probably that the scan is scheduled to a system thread while the scanitself is a fast operation.

The conclusion of the measurements is that kernel mode web servers shouldbe worth considering from a performance viewpoint, that the performancegain from a kernel mode web server is bigger on Windows than on Linuxwhen compared to user mode web servers on the same platform and that

the prototype performs well compared to the kernel mode web servers onLinux but that there is room for further optimizations.

6.3 Stability

By tradition as much software as possible is placed in user mode while kernelmode is only chosen when direct access to hardware is needed or performancemotivates it, for example by device drivers. The most important reason forthis is stability, programs that execute in user mode is protected from eachother, every process has its own virtual memory context, this means that

an error in one process, like using an invalid pointer to write to the wrongmemory location, can not affect other processes instead will the process inquestion be terminated. Software that execute in kernel mode on the otherhand has full access to the operating systems internal data structures andthe hardware, an error in this code usually brings down the whole system orcreates data corruption. Therefore it has been debated if a web server reallybelongs in kernel mode even if it means some performance gain, howeverthere is a trend to move more and more software to kernel mode, for exampledoes both Windows and Linux runs its network file system servers in kernelmode. A web server can be compared to a simple file server, relative other

47


52/57

file servers is it small and not very complex so the arguments to not run a

web server in kernel mode is weak considering todays trends.

6.4 Usability

Another thing to consider when deciding if a kernel mode web server is worththe performance gain is usability. A web server needs a certain amount ofadministration, for example it need to be configured, also it will produce logfiles that should be studied and archived. Adding a kernel mode web serverthat handles some requests and hands over others to the user mode webserver incr

Documents

High Performance Kernel Mode Web Server for Windows