[IEEE 2007 IFIP International Conference on Very Large Scale Integration - Atlanta, GA, USA (2007.10.15-2007.10.17)] 2007 IFIP International Conference on Very Large Scale Integration

Evaluating Memory Sharing Data Size and TCP Connections in the Performance of a Reconfigurable

Hardware-based Architecture for TCP/IP StackJean Carlo Hamerski, Everton Reckziegel, Fernanda Lima Kastensmidt

Universidade Federal do Rio Grande do Sul Instituto de Informática – PPGC

Porto Alegre - Brazil{jchamerski, ereckziegel, fglima}@inf.ufrgs.br

Abstract – The TCP/IP (Transmission Control Protocol/Internet Protocol) Stack processing based onsoftware becomes a bottleneck for the explosive growthof data transmission rate on the Internet. Software-based TCP/IP is not able to process the packets at the same rate of transmission lines, which has been pushing the TCP/IP processing implementation into hardware.The use of dedicated hardware for TCP/IP stack processing aims reducing the Central Unit Processing (CPU) load and increase as possible the throughput for Internet services that need a large bandwidth. In thisway, a reconfigurable hardware-based architecture to transport and network layers protocols processing isproposed. The effect of shared memory data size andthe number of TCP connections were evaluated interms of area and packets computation performance.

I. INTRODUCTION

Due to the advance of semiconductors technology, theEthernet transmission lines have increased from 3 Mbps to10 Gbps, and the processor speed has increased at the samerate. But the capability of software-based TCP/IP stackprotocols processing has not increased at the same rate [1].Some researches [2, 3] indicate that the biggest overhead of TCP/IP processing is because of I/O management and buffers of operating systems. When a high transmissionrate is needed, and so processing, these tasks (I/O) can need a very big CPU time in order to process TCP/IPpackets. Consequently, other applications that are beingexecuted on the same platform could not have CPU time toexecute theirs tasks.

By implementing the hardware-based TCP/IP stack inhigh-speed computing environments, administrators canhelp relieve network bottlenecks and improve applicationperformance [17].

In this paper, we propose dedicated hardwarearchitecture for the TCP/IP stack processing forreconfigurable devices. A set of implementation versionsbased on different memory sharing data size and thenumber of TCP connections were synthesized into XilinxVirtex II-Pro FPGA. Results were analyzed in terms of

area and performance evaluating the influence of data size and TCP connections. Moreover, a network trafficfunctional simulation was developed over bothimplementations in order to analyze the performance in relation to the packets computing time.

The proposed architecture aims providing a friendlyinterface between the TCP/IP processing in hardware andthe application layer processing in software. The goal ofthis approach is surpass traditional TCP/IP processinglimitations in software and minimize the system resourcesneeded to this processing.

The paper is divided as follows. Section 2 presents therelated works that aim accelerate the TCP/IP stack processing in software and in hardware and how this workis inserted on the researches done. Section 3 introduces theTCP/IP model used on the implementation of the proposedarchitecture on this work. Section 4 presents proposedarchitecture, implementations, and the simulationenvironment of the network flow used for testing proposes.Section 5 shows a preliminary communication interface between the TCP/IP Module in hardware and theapplication layer in software. In the end, the section sixshows the work conclusions.

II. RELATED WORKS

The works which aim to reduce the overhead ofprocessing of TCP/IP stack can be divided in two groups:software solutions – they identify the processingbottlenecks that can be optimized, such as kerneloperations and device drivers; hardware solutions – theyhave the advantage of hardware processing speed, whichsolves communication problems.

The mainly bottlenecks of software processing are:interruption operations, data copy, etc. Some works try tosolve these questions. In [4], the bottleneck is identified onmemory traffic that exists between the user area and the kernel area, beyond the traffic between the kernel area andthe buffers of network devices. So the goal is to developnetwork devices architecture together with acommunication interface with the operating system kernel,

212

978-1-4244-1710-0/07/$25.00 c© 2007 IEEE

where the checksum and packets storage tasks are done ina dedicated hardware.

One software solution that has been proposed in [5]shows an optimized algorithm on TCP packets processing.The most frequently used method is reducing the overheadthrough checksum calculation of packets by a dedicatedhardware. In [6], a technique proposed combine copy and checksum operations to reduce the checksum computingcost.

Other proposed technique is a “zero-pass checksum”[7], where the packets are moved straight from systemmemory to network devices. Another scheme isimplementing the Internet checksum directly in the network devices [8].

A TCP/IP Core for reconfigurable logic presented in[9] is able to process TCP/IP stack packets fully in dedicated hardware. Other works [10,11] also follow thesame idea.

The referenced works use several techniques todecrease the processing overhead. However, they are not parameterized for different application requirements. The proposed architecture can be customized for the number of TCP connections, ARP (Address Resolution Protocol)table size and shared memory size. In addition, the authors implemented two versions of the TCP/IP stack using twoword sizes for the shared memory in order to analyze its influence in the performance of the protocol processing.Beyond this, it provides an interface that makes possible the services implementation which uses the stack to the “application-to-application” communication.

A comparison between the proposed TCP/IP Moduleand related works will be done in future works, after itsvalidation on real network environment.

III. INTERNET COMMUNICATION MODEL

Before explain architecture details, it is important toshow some models that have defined the communicationform of Internet.

A standard model that makes possible understandingand projecting network architectures is known as OSI model (Open Systems Interconnection). This modeldefined a theoretic standard based in seven layers (Fig. 1a). It has just served as a reference to a good understanding ofthe communications processes of computer networks.

A given layer in the OSI model generallycommunicates with three other OSI layers: the layerdirectly above it, the layer directly below it, and its peer layer in other networked computer systems. For any sentinformation, at each transition level, a header is inserted in order to allow the peer layer to understand what it meansand to know what to do with the receiving data. On the receiving, after the treatment, these headers are discarded and the packet is sent to the above layer. When the packetarrives at the receiver application layer, it is acomprehensive data. The application layer handlesmessages, the transport layer, segments, the network layer, datagram, the data link layer, frames and the physics layer,bits.

The TCP/IP Model (Fig. 1b) is a hybrid modeloriginated from the OSI model. It joins the three higherlevel layers in only one application layer. The Fig. 1 showsthe difference between the models.

Thus, the proposed architecture is based on the four layers of TCP/IP Model. The physics layer is implementedby the existing one in a network device, for example,Ethernet. The network and transport layers are implemented by the TCP/IP Module in hardware. The software platform is responsible by the tasks execution ofthe application layer.

IV. TCP/IP MODULE IN HARDWARE

A. Architecture

The preliminary architecture proposed on this paper isbased on the idea of a shared store space between the network interface, TCP/IP hardware module and theapplication which is running on a software platform. TheFig. 2 exposes the differences between both the traditional and proposed approaches.

(a) (b)

Figure 2 (a) Software TCP/IP and (b) Hardware Architecture

The network interface is responsible for assemblingpackets which come from physics layer and store them on the shared memory. An important observation is that theTCP/IP hardware module is independent of networkinterface which is being used, e. g.: Ethernet, Wireless, etc.

The TCP/IP Module implemented in hardware isresponsible for the IP (Internet Protocol), ICMP (InternetControl Message Protocol), ARP (Address ResolutionProtocol) and TCP (Transmission Control Protocol)processing protocols. Tasks such as connectionestablishment, data reception or transformation, checksumand connections management are done by the TCP/IP

Application

Presentation

Session

Transport

Network

Data Link

PhysicalPhysical

Network

Transport

Application

Applications SoftwareApplication

SoftwareTCP/IP

TCP/IPModule

SharedRAM

Hardware

Hardware Network Interface Network Interface

b) TCP/IP Modela) OSI Model

Figure 1 – OSI Model and TCP/IP Model

2007 IFIP International Conference on Very Large Scale Integration (VLSI-SoC 2007) 213

Module. The use of dedicated hardware for this processingaims to increase the throughput and to reduce the CPUcharge, comparing with the method which all TCP/IPprocessing is performed by software [17].

In the case of hardware implementation, the entireInternet data stream management is done by TCP/IPModule, while higher level instructions, more preciselyapplication layer services, are performed at software level.

The modular characteristic of TCP/IP model protocolsstack composition allowed the proposed architecturedivision in modules according to the protocol type. Thus, it can be configured according with the service that is aimingto run on TCP/IP stack. For example, those services that do not need functions implemented by the ICMP protocol candispense the correspondent module, so the designer couldjust extract the ICMP module in order to configure it to thespecific application.

The TCP/IP Module, on its internal architecture, iscomposed by the sub-modules described on Fig. 3. The Main Control Module (MCM) is responsible for theTCP/IP Module coordination and effective operation. Itmakes the first data checking, like IP header checksum,protocol version checking and, if it is a valid packet, itidentifies the protocol type and redirects the packet to thecorresponding sub-modules (or, on the other case, itdiscards the packet).

The other modules (ICMP, ARP and TCP) are responsible for processing their own protocols.Particularly, the ARP Module has a table, which maps thephysical address (MAC) to the logical address (IP). The TCP Module keeps the connections states into internalregister to module. The ICMP module builds the ICMPecho response packet to be retransmitted to the network.

The sub-modules are implemented using statemachines that handle the sending and receiving packetsprocess and the operations which are done during themachine states evolution.

The TCP/IP Module architecture was described inVHDL language. Some configuration parameters must tobe set before of the synthesis, which characterize the feature of architecture customizing. The configurationsparameters are:

The number of simultaneous TCP connections;Shared memory size that stores the packet in processing;ARP table size which does the mapping betweenphysical address (MAC) and logical address (IP);IP address, gateway, physical address, and host sub-net mask configuration.

Concerning of the connections number that TCPmodule should control: this number is set by thearchitecture user before its synthesis. The connection statesare stored in internal registers. Therefore, the simultaneousconnections number is proportional to the area of the TCPModule. Likewise, the ARP module area is proportional tothe entrances number in the ARP table.

B. TCP/IP Hardware Implementations

The network interface is responsible for assemblepackets coming from physics layer and store them in shared memory. The manner how packets are stored in memory, in relation to the word size, will define the state machine structure of the TCP/IP Module architecture. So,two different TCP/IP Module implementations weredeveloped. The difference between them is the word size written or read from memory (16 and 32 bits) andconsequently state machine descriptions. Bothimplementations were analyzed altering the number of simultaneous TCP connections which the implementationis able to manage.

The number of TCP connection has an influence onlyin the TCP module since this module is the one responsibleto store and handle all the number of TCP information. The main effect is in the number of registers used to store the TCP connections.

These implementations were all described in VHDL and synthesized using the Xilinx ISE 8.1 tool for the XilinxVirtex II-Pro XC2VP30-5ff896 FPGA. The number of lookup tables (LUTs), which are the minimumconfigurable logic gates used to implement the user combinational logic inside the FPGA, and the number offlip-flops (FFs) were counted to evaluate the used area for the proposed implementations. Also the estimatedfrequency results of each sub-module and of the entire TCP/IP Module were also evaluated. Results are presentedin Tab. 1.

By analyzing the algorithm that implements theTCP/IP stack, one can observe that the majority of datafetched on memory has a 32 bits word size, such as thenetwork address (32 bits). Consequently, when 32-bit datain the shared memory is considered, the complexity of thestate machine can be reduced because now all the processcomputation is performed in 32-bit. However, the problemis that 32-bit operators, such as 32 bits adders, decrease the frequency compared with the 16 bits implementations. So,even the reduction in the state machine complexity in termsof the number of states can not compensate the increase in the delay of the operators.

Figure 3 – Internal architecture of TCP/IP Module

SHA

RE

D R

AM

RA

M Interface

TCP/IP Module

Application Layer Interface

Main Control Module

TCP ModuleARP Module ICMP Module

Network Interface ayerApplication L

Control Bus

Data Bus

214 2007 IFIP International Conference on Very Large Scale Integration (VLSI-SoC 2007)

Table 1: Synthesis results of two architecture implementations16 bits 32 bits TCP

ConnectionsFactor

MCM ICMP ARP TCP TOTAL MCM ICMP ARP TCP TOTAL

LUTs 758 208 836 6035 7936 728 165 672 5873 7551

FFs 248 113 385 2556 3346 181 56 302 2491 301510

Fmax (MHz) 77,5 207,7 222,8 70,9 71,2 59,8 414,3 159,3 58,7 58,7

LUTs 758 208 836 4565 6459 728 165 672 4074 5735

FFs 248 113 385 1578 2368 181 56 302 1527 20475Fmax (MHz) 77,5 207,7 222,8 70,9 71,2 59,8 414,3 159,3 58,7 58,7

LUTs 758 208 836 2555 4480 728 165 672 2278 3968

FFs 248 113 385 758 1547 181 56 302 648 11721Fmax (MHz) 77,5 207,7 222,8 70-9 71,2 59,8 414,3 159,3 58,7 58,7

One can see that the bottleneck in terms of frequencyis always the TCP module and this result does not changewith the number of TCP connections (10, 5 or 1). The reason of this low frequency is due to the complexity of thechecksum operation that is performed in modules MCMand TCP. The number of TCP connections influences onlythe area of the TCP module as seen in Tab 1.

So, the entire TCP/IP module must work at thefrequency limited by he TCP module. In summary, the 16-bits version operates at a frequency of 71,2 MHz, while the32-bit version operates at 58,7 MHz.

C. Experiments

We developed a network flow simulation environmentin order to perform a functional simulation of the TCP/IPModule to verify the implementations performance in case-study applications. This environment is composed by theTCP/IP Module presented on section three, besides a packet generator module and a third module whichrepresents the shared memory area where the processingpacket is stored. The Fig. 4 shows the diagram blocks ofthis environment.

16/32

16/32

16/32

PacketGenerator

TCP/IPM odule

Shared RAM(1524 bytes)

Mem

ory Interface

On this simulation environment, the Packet Generatorhas a set of packets with well-known behavior. The network flow simulation begins when the first packet isgenerated by the Packet Generator and stored in sharedmemory by itself. The TCP/IP Module is set in motion bythe Packet Generator, informing that there is a packet to be processed. The TCP/IP Module will do a set ofverifications on the packet, like destiny network address, packet version and checksum. At the end of processing, ifa packet was generated by TCP/IP Module and should be dispatched, it will emit a command signal with this

information. On this environment the packets sending isnot showed and its implementation inside a real networkenvironment is easy.

The Packet generator is able to generate a set of four packets, which define a little packets flow that happens in a real network environment:

Arp_Reply (60 bytes) – ARP packet response to an ARP_Request;ICMP_Echo (74 bytes) – ICMP solicitation packet;Syn (62 bytes) – Connection solicitation (sync) to aTCP service; Syn_ack (60 bytes) – Confirmation (sync + ack) of a connection solicitation to a TCP service;

The proposed architecture implementations wereanalyzed in relation to the computing time of the referredpackets. The Tab. 2 shows the results obtained from thepackets flow simulation.

As it can be observed on Tab. 2, the results of 16 bitsdescription are better than the 32 bits description inrelation to the computing time of the four packets thatsimulate a real network flow. The Syn packet processingwas the only flow which had the computing time affectedby the number of TCP connections. This occurs becausethe first packet used for establishing the connection, theSyn packet, needs to analyze every one registers that store the connection states. Consequently, the computation timeof this packet will be proportional to the number of TCPconnections.

A comparative analyzes of obtained results on twoexperiments detailed on Tab. 1 and 2 reveals that theoccupied area of the generated circuits from the 16 bitsdescription is 12% greater, but, the computing time is 18%minor than the 32 bits description (these rates werecalculated to the three configurations of TCP connectionsnumber). Based on these results, the chosen architecture will depend on the restrictions and requirements of theapplication. Consequently, if the application restriction isthe circuit area, so the best option will be the 32 bitsdescription. Otherwise, if there is a performancerequirement, the 16 bits description will be the best choice.

Figure 4 – Blocks diagram of the simulation environment


Table 2 – Results of Packets Computing Time ( s)16 bits 32 bits TCP

connections ARP_Reply ICMP_Echo Syn Syn_ack ARP_Reply ICMP_Echo Syn Syn_ack

10 0,2031 0,5252 1,3515 0,7073 0,2706 0,6061 1,6234 0,7251

5 0,2031 0,5252 1,2815 0,7073 0,2706 0,6061 1,5152 0,7251

1 0,2031 0,5252 1,2254 0,7073 0,2706 0,6061 1,4286 0,7251

V. THE HARDWARE/SOFTWARECOMMUNICATION INTERFACE

The TCP/IP Module architecture, as showed onsection 4, has a communication interface with theapplication layer. Some information about TCP connectionshould be passed to the service that is being executed onthe application layer. In this way, the TCP/IP model makesavailable through this interface some data like the portnumber (which will sue the respectively service), and thepointer to the data area of TCP packet.

The software layer also initializes the TCP/IP Module, moreover it synchronizes the incoming and outgoingpackets, and sue the correspondent service.

For example, in a Web Server, the communicationbetween the TCP/IP model and the service that isexecuting on microprocessor is done when there is a page request (get http). When it happens, the service has access to the packet data area. And this packet will be stored onthe shared memory. The software service will transfer therequested page content and will return the control to theTCP/IP model which will fill the header packet andcalculate the checksum for the packet sending. Once finished the processing, the TCP/IP model will sue thedevice driver to allow the sending packet through thenetwork interface.

For the implementation of the application layer insoftware and the synchronization with the TCP/IP model inhardware, it is used the Xilinx Virtex II-Pro device [12].This device has an embedded Power-PC which can be used on this kind of board. The HW/SW architecture projecteduses the device drivers available by communication API ofXilinx environment to manage the incoming and outgoingpackets. The EDK 8.1 Xilinx tool [13] was used to modules

integration in hardware and in software. It uses somecharacteristics on development board, such as device drivers, link layer managing modules insertion, amongothers functionalities.

The HW/SW architecture, with all used platformcomponents is showed on Fig. 5.

The BRAM_BLOCK of demonstrated architecture on Fig. 4 is the shared memory block used to store the packet.

The used platform allows the addition of an interface to an external SRAM memory. We could use this characteristic to implement a buffer to store packets in a queue, for example. On the current version thisfunctionality is not implemented and the used memoryblock has size enough to store at the most an Ethernetmaximum size packet (1500 bytes). Some studiesdemonstrate that Jumbo packets (9000 bytes) utilization increases the processing throughput of a network channel[14]. This kind of packet can be used on the proposedarchitecture. For this, before the architecture description synthesis, we have to configure the parametercorresponding to the shared memory size that stores thepacket.

Finally, if we think on embedded systems, the used platform is similar to the embedded systems platforms. So,the proposed architecture could be easily adjustable for usein embedded systems. These systems need high CPU loadto execute its applications and don’t desire that CPU be occupied with TCP/IP stack processing. It increases the throughput and allows the Internet specialized services use,like iSCSI [15] and RDMA [16]. Other interestingcharacteristic of the proposed architecture is that it’s notnecessary a lot of memory to store the packet. This is interesting because embedded systems have memoryrestrictions.

Figure 5 –The HW/SW architecture on Xilinx Virtex II-Pro platform

Powerpc(PPC405)

PLB2OPB_BRIDGE

JTAG_PPC_CNTLR

BRAM_BLOCK

PLB_BRAM_IF_CNTLR

PLB_ETHERNET TCP/IP Module OPB_UARTLITE

External SRAM

OPB_EMC

216 2007 IFIP International Conference on Very Large Scale Integration (VLSI-SoC 2007)

VI. CONCLUSIONS AND FUTURE WORKS

This work has analyzed the influence of memorysharing data size and number of TCP connections in theperformance of Hardware TCP/IP architectures for high-throughput internet services in reconfigurablearchitectures. The proposed architecture is composed by a specific module to process packets from network and transport layers. The dedicated hardware accomplishes allprocessing of these protocols. It gets better the throughputand frees the CPU of the TCP/IP stack processing that would be done in software.

Two versions of the architecture, one with 16 bits and other with 32 bits were implemented and synthesizedaccording to the state machine structure which defines the sub-modules behavior, and three different configurationsin relation to the TCP connections number. The obtainedresults were analyzed. The choice of which description will be used will depend on the restrictions and requirements ofthe goal application.

The hardware/software preliminary interface to thespecialized services implementation on TCP/IP stack was presented.

As future work, it is defined a complete validation ofthe proposed architecture and its integration with anEthernet network interface in order to validate a real network environment. After the validation of the TCP/IPModule on real network environment, a study about thecurrent approach gain in relation to the processing insoftware will be done. So we will be able to prove the advantages of the presented architecture of this work.

REFERENCES

[1] WANG, W., WANG, J., LI, J., Study on EnhancedStrategies for TCP/IP Offload Engines, ICPADS' 05: Proceedings of the 11th International Conference on Parallel and Distributed Systems,pp. 398-404, 2005.

[2] CLARK, D., JACOBSON, V., et al., An analysis of TCP processing overhead, IEEE Comm., vol. 27, no. 6, pp. 23-29, 1989.

[3] MARKATOS, E., Speeding up TCP/IP: fasterprocessors are not enough, in 21th IEEE Int. Perf., Comput., and Comm. Conf., pp. 341-345, 2002.

[4] STEENKISTE, P., Design, implementation and evaluation of a single-copy protocol stack,Software - Practice and Experience, vol. 28, no. 7, pp. 749-772, 1998.

[5] JACOBSON, V., 4BSD header prediction, ACMComput Comm. Rev., vol. 20, no. 1, pp. 13-15,1990.

[6] CLARK, D., Modularity and efficiency in protocolimplementation, RFC 817, 1982.

[7] FINN, G., HOTZ, S., METER, R. V., The impactof a zero-scan Internet checksumming mechanism,ACM Comput. Comm. Rev., vol. 26, no. 5, pp. 27-39, 1996.

[8] KLEINPASTE, K., STEENKISTE, P., ZILL, B., Software support for outboard buffering and

checksumming, in Proc. of the ACMSIGCOMM' 95 Conf. on App., Tech., Archit., and Protocols for Comput. Comm., pp. 87-98, 1995.

[9] DOLLAS, A., ERMIS, I., KOIDIS, I., ZISIS, I.,KACHRIS, C., An Open TCP/IP Core for Reconfigurable Logic, proceedings of the 13thAnnual IEEE Symposium on Field-ProgrammableCustom Computing Machines (FCCM' 05), p. 297-298, 2005.

[10] BOKAI, Z., CHENGYE, Y., TCP/IP Offload Engine (TOE) for an SOC System, Institute ofComputer & Communication Engineering,National Cheng Kung University, 2005.

[11] KANT, K., TCP offload performance for front-end servers, Global Telecommunications Conference,vol. 6, p. 3242 – 3247, 2003.

[12] Virtex II Pro Platform FPGA Handbook,http://www.xilinx.com.

[13] Platform Studio and EDK, http://www.xilinx.com.[14] DYKSTRA, P., Gigabit Ethernet Jumbo Frames,

White Paper, WareOnEarth Communications,1999.

[15] Clark, T., IP SANs, A Guide to iSCSI, iFCP andFCIP Protocols for Storage Area Networks,Addison-Wesley, 2001.

[16] Romanow, A., Bailey, S., An Overview of RDMAover IP, The Internet Society, 2002.

[17] Senapathi, S., Hernandez, R., Introduction to TCPOffload Engines, Network and Communications,Power Solutions, 2004.


Documents

[IEEE 2007 IFIP International Conference on Very Large Scale Integration - Atlanta, GA, USA (2007.10.15-2007.10.17)] 2007 IFIP International Conference on Very Large Scale Integration