Real-timepre-processingsystemwithhardware ...Cheng et al. / Front Inform Technol Electron Eng 2017 18(11):1720-1731 1721 Products, Inc., 2010; Peemen et al., 2013; Kekely et al., 2014),

1720 Cheng et al. / Front Inform Technol Electron Eng 2017 18(11):1720-1731

Frontiers of Information Technology & Electronic Engineering

www.jzus.zju.edu.cn; engineering.cae.cn; www.springerlink.com

ISSN 2095-9184 (print); ISSN 2095-9230 (online)

E-mail: [email protected]

Real-timepre-processing systemwith hardware

accelerator formobile core networks∗#

Mian CHENG, Jin-shu SU‡, Jing XU(College of Computer, National University of Defense Technology, Changsha 410073, China)

E-mail: [email protected]; [email protected]; [email protected]

Received Aug. 2, 2017; Revision accepted Sept. 30, 2017; Crosschecked Nov. 23, 2017

Abstract: With the rapidly increasing number of mobile devices being used as essential terminals or platformsfor communication, security threats now target the whole telecommunication infrastructure and become increasinglyserious. Network probing tools, which are deployed as a bypass device at a mobile core network gateway, cancollect and analyze all the traffic for security detection. However, due to the ever-increasing link speed, it is ofvital importance to offload the processing pressure of the detection system. In this paper, we design and evaluate areal-time pre-processing system, which includes a hardware accelerator and a multi-core processor. The implementedprototype can quickly restore each encapsulated packet and effectively distribute traffic to multiple back-end detectionsystems. We demonstrate the prototype in a well-deployed network environment with large volumes of real data.Experimental results show that our system can achieve at least 18 Gb/s with no packet loss with all kinds ofcommunication protocols.

Key words: Mobile network; Real-time processing; Hardware accelerationhttps://doi.org/10.1631/FITEE.1700507 CLC number: TP309.2

1 Introduction

The past decade has witnessed huge growth inthe popularity of mobile devices, but this createssignificant network security issues. Security inci-dents that previously occurred only in conventionalnetworks are now occurring in mobile communica-tion networks, such as explosive worms, viruses, anddistributed denial-of-service (DDoS) attacks. Theopenness of the mobile network makes applicationdevelopers and interactive business more accessibleto core networks and databases. Thus, effectivelymonitoring and processing security incidents that oc-cur on the mobile Internet is an important challenge

‡ Corresponding author* Project supported by the National High-Tech R&D Program(863) of China (No. 2012AA013002)# A preliminary version was presented at the 6th InternationalConference on Instrumentation, Measurement, Computer, Com-munication and Control, July 21–23, 2016, China

ORCID: Jin-shu SU, http://orcid.org/0000-0001-9273-616Xc©Zhejiang University and Springer-Verlag GmbH Germany 2017

for network managers.Real-time collection of network traffic is impor-

tant for network service providers. In addition tomonitoring malicious behaviors, collected traffic canbe used to predict possible traffic conditions and op-timize the network environment through selectionof optimal routing paths and configuration of load-balancing policies. When connecting to a high-speedlink, most detection systems use server cluster tech-nology that collaboratively analyzes a traffic streamwithout sacrificing detection accuracy and must re-sort to specialized hardware for front-end traffic dis-tribution to a set of back-end servers (Vallentin et al.,2007). However, the size of the cluster will bevery large and incur greater costs when transmissionspeeds are improved. Although many researchersfocus on the optimization of the packet captureengine through software-based solutions (Vasiliadiset al., 2011; Rizzo, 2012; Cisco Systems, Inc., 2013)or hardware acceleration (Han et al., 2010; Intel

www.jzus.zju.edu.cn

engineering.cae.cn

www.springerlink.com

Cheng et al. / Front Inform Technol Electron Eng 2017 18(11):1720-1731 1721

Products, Inc., 2010; Peemen et al., 2013; Kekelyet al., 2014), these approaches are still limited toonly a few common protocols (e.g., HTTP) and can-not satisfy the requirements of mobile networks.

Different access networks are consolidated toconnect the radio access network and the traditionalInternet in the mobile core network. The 3GPP2standard (China Communications Standards Asso-ciation, Inc., 2006) sets several packet encapsula-tion protocols (e.g., GPRS tunneling protocol (GTP)and generic routing encapsulation (GRE)) and com-pression structures for packet transmission optimiza-tion. The packet structure determines the com-plexity of the process, and advanced analysis of themessage content (e.g., deep packet inspection) con-sumes much of the processing capability. With theconstantly increasing of network bandwidth, high-performance real-time collection of network traffic ina limited time and without loss is still a challenge.

In Cheng et al. (2016), we have presented asimplified system for mobile core network measure-ment, which sketchily introduced a two-level pre-processing mechanism for collecting and distributingpackets. However, it was not well established. Tofurther refine our research, in this paper, we proposea real-time pre-processing system with a hardwareaccelerator for mobile core networks. The imple-mented prototype can quickly process each encapsu-lated packet and effectively distribute the restoredpacket to back-end servers. Fig. 1 gives the overviewof our implemented pre-processing system architec-ture. The packet processing procedure has two com-ponents. First, the hardware accelerator, based ona field-programmable gate array (FPGA), performsprocessing which has a simple calculation procedurebut consumes plenty of computing resources, suchas packet decapsulation and character transforma-tion of PPP frames in the CDMA2000 core network.Then for those operations that require significant re-source overhead, such as packet decompression andrecombination, we use a multi-core processor to sig-nificantly improve the processing capacity in a high-speed mobile core network. The hardware accelera-tor uses two distribution approaches to satisfy differ-ent packet distribution requirements from back-endservers. In most cases, we use the five-tuple in eachpacket, and flows are split roughly equally acrossequal length paths. If a suspicious user must betracked, a distribution approach based on the user’s

terminal address is proposed to make ensure the in-tegrity of the bidirectional data stream. The mainidea is the cooperation between the hardware accel-erator and the multi-core processor to identify theuser terminal IP address of each data packet. Ex-periments on a real dataset demonstrate that ourprocessing prototype can provide high throughputand can be applied to all kinds of InternationalTelecommunication Union (ITU) standards, includ-ing CDMA2000, WCDMA, TD-SCMDA, and LTE.

Controller

IPMB EPLD

Multi-coreprocessor

4×DDR3

Local bus

Hardwareaccelerator

8×X

FI

TCAM SRAM

24×S

FP+ Real-time

streams

Back-endsystems

RJ-45

COM

Real-time pre-processing system

Fig. 1 Structure of the real-time pre-processingsystem

2 Related work

In this section, we list some packet process-ing engines. Section 2.1 introduces the software-based solutions, which have achieved significant per-formance improvements in packet capture comparedwith the Linux native approach. Section 2.2 summa-rizes the hardware acceleration approaches, whichgive us a better way to implement packet processing.Section 2.3 points out the advantages of the hard-ware method and proposes our research direction.

2.1 Software-based solutions

Software-based solutions often focus on paral-lel processing or optimization of processing architec-ture based on the traditional CPU. In Rizzo (2012),NETMAP was proposed to optimize the packet cap-ture engine. It adopts batch processing and pre-allocates a fixed cache space (2048 bytes) in the ini-tialization phase. NETMAP implements memory-mapped technology to allow applications to directlyaccess the metedata structures in the kernel pack-age, which is called NETMAP-RING. Note thatthe receipt and delivery of each receive slide scal-ing (RSS) queue can use the NETMAP-RING todirectly achieve parallel data channels. On the otherhand, conventional computing platforms usually usean inefficient raw socket or a packet capture (PCAP)


interface to transmit packets to the user applica-tion level. MIDeA (Vasiliadis et al., 2011) adoptsPF_RING to quickly pass the packet to the userlayer space and runs the snort intrusion detectionsystem on a common multi-core computing plat-form. Experimental results show that the system canachieve 5.2 Gb/s network packet processing capabil-ity by using the multi-queue and GPU acceleratortechnologies. After receiving the packet, MIDeA in-vokes snort’s decoding engine to process the contentof the packet, and then passes packets to the GPUfor intrusion detection. The data packet distributiontool in MIDeA is a hardware component. Comparedwith the pipelined model composed of traditionalprocessing cores, the hardware solution is more effi-cient and the whole system has better scalability.

In a mobile core network, some access servers areimplemented by software. For instance, the CiscoASR 500 Series packet data serving node (PDSN)product’s (Cisco Systems, Inc., 2013) packet process-ing card is implemented by a 2.5-Hz, quad-core x86architecture processor and 16 GB of RAM, support-ing a maximum of 2 million PDSN sessions, with amaximum processing capacity of 5 Gb/s. The multi-core processor can be seen as an integrated multiplecore system on chip (SoC). Each core is a separateoperation unit, and has its own separate instruction,data cache, and operating system scheduler. We canexecute each core business concurrently without dis-turbing other units and set up multiple threads oneach core.

2.2 Hardware acceleration

With the rapid growth of network bandwidth,the performance of software-based solutions was un-able to satisfy line-speed packet collection require-ments. Hardware acceleration based on FPGAs,application-specific integrated circuits (ASICs), andgraphics processing units (GPUs) are of interest tosystem developers. The main idea of these commod-ity components is to offload a part of CPU packetprocessing functions to the hardware to improveperformance.

In the field of custom chips, Intel proposed ahigh-speed packet processing platform, Crystal For-est (Intel Products, Inc., 2010). The structure ofthe ‘multi-core CPU + dedicated accelerator’ usesan ASIC to implement the acceleration function. Itcan effectively undertake some complicated packet

processing functions, offload the pressure of themulti-core processor, and improve overall process-ing efficiency. However, ASIC’s shortcomings in-clude expensive customization and lack of scalabil-ity. On the other hand, many studies have recentlyexperimented with the GPU to accelerate packetprocessing in network applications. This providesa significant performance boost when compared tothe CPU-only solution. Han et al. (2010) proposedPacketShader, which adopts a heterogeneous packetprocessing architecture of ‘CPU+GPU’ to acceler-ate packet processing. This approach offloads thepacket protocol processing functions to the GPU,including IPv4 and IPv6. The CPU is responsiblefor only receiving and sending the packets. Cav-igelli et al. (2015) presented a convolutional networkaccelerator that is scalable to network sizes thatare currently handled by only workstation GPUs,but remains within the power envelope of embed-ded systems. It can significantly improve the ex-ternal memory bottleneck of previous architectures,is more area efficient than previously reported re-sults, and comes with the lowest-ever reported powerconsumption when including I/O power and exter-nal memory. Recently, Go et al. (2017) discussedeight popular algorithms widely used in network ap-plications and suggested employing integrated GPUin recent accelerated processing unit (APU) plat-forms as a cost-effective packet processing acceler-ator. Results demonstrate that network applica-tions based on APUs can achieve high performance(over 10 Gb/s) for many computation- and memory-intensive algorithms. However, the consumption ofthe GPU-based accelerator is too high, causing un-necessary waste.

FPGA is more flexible and scalable than othernetwork acceleration. It can be adjusted according todifferent platform requirements, not only to satisfyhigh-speed Internet traffic processing needs, but alsoto reduce energy consumption. Kekely et al. (2014)proposed an optimization method called software de-fined monitoring (SDM), which is based on a config-urable hardware accelerator implemented in FPGAand some smart monitoring tasks running as softwareon a general CPU. SDM is an optimization of a flexi-ble flow-based network traffic monitor that supportsapplication protocol analysis. Lavasani et al. (2014)presented a method for accelerating server appli-cations using a hybrid ‘CPU+FPGA’ architecture.


It processes request packets directly from the net-work and avoids the CPU in most cases. Fast-pathand slow-path techniques were proposed to specula-tively execute a hot path by slicing the applicationand generating the fast-path hardware accelerator.Neil and Liu (2014) introduced an event-driven neu-ral network accelerator. It can be integrated intoexisting robotics or it can offload computationallyexpensive neural network tasks from the CPU. Re-searchers have also proposed FPGA-based accelera-tors to compute over a billion operations per inputimage (Peemen et al., 2013; Zhang et al., 2015).

2.3 Comparison

Subject to the constraints of CPU processingcapacity, a software-only solution no longer satis-fies the requirement of core network processing. Inthe above hardware acceleration approaches, FPGAsare less costly and more scalable than ASICs. Inthe meantime, FPGAs have better flexibility thanGPUs. According to the packet characteristics in themobile core network, we propose an FPGA+multi-core processor architecture to effectively process en-capsulated and compressed packets.

3 System architecture

Fig. 2 depicts the architecture of the pro-posed real-time pre-processing system with a hard-ware accelerator. The system consists of a multi-core processor, a hardware accelerator, and pe-ripheral units such as ternary content-addressablememory (TCAM) and static random access memory(SRAM). The hardware accelerator and multi-core

processor are implemented on an Alterla Stratix VGX FPGA and a Broadcom XLP432 processor, re-spectively. To reduce the processing time, we usea two-stage structure to handle packets from themobile core network. Upon receiving a packet, thehardware accelerator performs the processing whichhas a simple calculation procedure but consumesplenty of computing resources, such as escape char-acter decoding and packet decapsulation. Then thepre-processed packet is transferred to the multi-coreprocessor for in-depth processing. The multi-coreprocessor performs the complicated calculation pro-cedure, such as fragment reassembly, decompressionof Van Jacobson (VJ) compressed packets, and inter-nal frame reassembly. A multi-threading approachand an MIPS64-based architecture in the processorguarantee high performance. When the packet feedsback to the hardware accelerator, the distributionbased on pre-set rules is done by the hardware accel-erator, TCAM, and SRAM. A standard RJ-45 and aCOM connector are deployed to allow the adminis-trator to control the system for rule issuing, systemmonitoring, and debugging.

3.1 Deployment scenario

The key components of the 3G and LTE net-works and the deployment of our pre-processing sys-tem are illustrated in Fig. 3. In WCDMA and TD-SCDMA packet core networks, we deploy our sys-tem to process packets through all Gn links betweenthe serving GPRS support node (SGSN) and thegateway GPRS support node (GGSN). From the Gninterface, we can analyze a large volume of user busi-ness data and control signaling such as PDP activate

Result matching

Tupleextracting

TCAM & SRAM

Hardware accelerator

Hash distribution

Group 1

Group 4

Raw data

Output

...

1

Thread #32Thread #3

Thread #2Thread #1

Outer IP reassemble

Multi-core processor

d #1

Construct IP packet

Out-of-order reassemble

Van Jacobson decompressionInternal frame

reassemble

Lear

ning

com

pone

nt

Decapsulation Character unescape

Control packet

Packet reconstitution

GRE

GTP

Fig. 2 Block diagram of the proposed real-time pre-processing system with a hardware accelerator


signaling, route update information, location areaupdate information, and mobile management infor-mation. In the CDMA2000 packet core network, wedo the same thing through all A10/A11 links betweenthe packet control function (PCF) and the PDSN.There are many differences between LTE and 3G inboth network structure and wireless technology. Wereceive packets from interfaces S1-MME, S11, andS1-U.

UE RAN

LTE core network

Internet

S1-MME

MME

…

SGW/PGWOur

system

Firewall

Back-end server

BTS/RNC

BSC/SGSN

PDSN/GGSN

3G core network

Internet

FirewallA10&A11

Gn

eNodeB

S1-US11

Fig. 3 Deployment in the 3G and LTE core network

3.2 Hardware accelerator

To accelerate processing, the hardware acceler-ator performs processing which has a simple calcu-lation procedure but consumes plenty of computingresources. The parallel architecture of the FPGAmeans it can act as an extremely effective offloadengine to relieve CPU bottlenecks. Shifting criti-cal code to the FPGA and running those algorithmsusing multiple streaming processes in the FPGAcan provide overall acceleration of 10 times or evenmore over CPU-only solutions. As shown in Fig. 2,we first assign packet decapsulation and charactertransformation to the hardware accelerator for pre-processing and reconstitute a custom packet struc-ture for direct interaction between the hardware ac-celerator and multi-core processor. After receivingthe restored packet, several distribution strategiesare ready to allocate each packet to the back-endservers.

3.2.1 Packet processing acceleration

Processing acceleration is the most critical func-tion of our system. First, to filter the packet thatcontains no user information, we discard all non-GRE and non-GTP packets, which may accountfor nearly 50% of the total traffic before process-

ing. Then we process different communication stan-dards in different approaches. In the WCDMA, TD-SCDMA, and LTE core networks, the user packet isdirectly encapsulated in the GTP header and it issimple to decapsulate and extract the user informa-tion. Hence, the hardware accelerator will processall GTP data packets and directly forward the in-ternal user packet to the back-end servers through adistribution strategy without any processing in themulti-core processor. In practice, we always need tomanage a lot of traffic from different operators atthe same time. Thus, the complete treatment of theGTP packet in the hardware accelerator can removesignificant processing pressure from the multi-coreprocessor, and provide better performance in pro-cessing complicated packets in CDMA2000.

1. Character transformationIn the CDMA2000 EV-DO core network, the

user packet is encapsulated into a PPP frame andtransmitted through the GRE tunnel. One GREpacket may encapsulate multiple PPP frames andeach frame may contain incomplete IP packets. Todistinguish each individual PPP frame, RFC 1661(Internet Society, Inc., 2014) defines that when 0x7eand 0x7d occur in the information field, they areconverted to 2-byte sequences 0x7d5e and 0x7d5d,respectively. Therefore, it is necessary to scan theentire PPP frame to complete the character trans-formation to restore the inside user packet. Theconventional algorithm, which usually runs in soft-ware, is to process PPP packets sequentially anddetect if there is a special character 0x7e or 0x7d.Apparently, the overall scanning will generate largeamounts of overhead (the time complexity is O(n),where n is the length of the PPP packet). Specif-ically, when the server receives many data streams,the processor has to spend great effort on scanningevery byte, which is the biggest bottleneck restrict-ing CPU processing performance. In addition, ina 2n-bit CPU platform (n ≥ 3), the range of datawidth is {1, 2, . . . , 2n−3} bytes. For example, a 32-bit or a 64-bit CPU can read-write only four or eightbytes, respectively. Therefore, we use the parallelfeature of the FPGA to improve processing efficiency.In the hardware environment, the data width canbe defined according to the availability of resources.Algorithm 1 proposes a 16-byte parallel approachon FPGA to improve the efficiency of the overallscanning.


As shown in Algorithm 1, we define Bc =

(B0, B1, . . . , B15) as the current 16-byte stream fromthe input first in, first out (FIFO) to the register,and Bp and Bn as the previous and next 16-bytestreams. The FIFO in the hardware accelerator canhold at least one packet (assuming a maximum of2048 bytes). In lines 1–11, the following conditionsare processed. In line 1, if there is no 0x7d in Bc,write the bytes into data alignment component andread Bn into the register. If B1 ends at 0x7d (line4), then the first byte needs to be transformed be-fore writing to the data alignment component. Fi-nally, if any other byte is 0x7d (line 7), we writethe current byte b to the (b + 1) byte after trans-forming, and shift each byte from 0 to (b − 1) byone bit. To reduce the logic complexity, each cycledeals with only one 0x7d, and if there are multiple0x7d in Bc, we process sequentially from right to left.After transformation, we need to remove the invalidbytes before writing to the FIFO. As shown in Fig. 4,the data alignment component is composed mainlyof a padding state machine, a read buffer state ma-chine, and two sets of registers (Buffer0 and Buffer1).The padding state machine receives data containinga ‘blank’ from the character transformation compo-nent, and then completes data aligned with Buffer0and Buffer1. The read buffer state machine sequen-tially reads the padded data and writes it to theFIFO. Throughout this process, we configure FIFOas 256×134 bits, of which two bits are used as flag

Algorithm 1 Parallel character transformationInput: addr_ppp: the starting address of PPP packet;

invalid_len: length of invalid byte.Output: len_ppp: the length of PPP packet after de-

capsulating.1: if 0x7d /∈ Bc then2: Bc → data alignment3: end if4: if Bp,16 = 0x7d then5: Bc,1 ⊕ 0x20 and invalid_len + 16: end if7: if Bc,b = 0x7d, 1 ≤ b ≤ 14 then8: Bc,b+1 ⊕ 0x20 → Bc,n

9: (Bc,b+2, . . . , Bc,16) � 1

10: invalid_len + 111: end if12: if Bc,15 = 0x7d then13: Bc,16 ⊕ 0x20 → Bc,15

14: invalid_len+115: end if

bits and another four bits represent the number ofinvalid data bytes; the data width is 128 bits.

The performance of the algorithm is analyzedbelow. From the above analysis, we can see that theperformance bottleneck is the character transforma-tion component. For one channel of the FPGA, wecalculate the theoretical processing throughput P as

P =Br

N · (1/M), (1)

where M is the frequency of the FPGA, N is thenumber of clock cycles, and Br is the maximum num-ber of bytes that can be processed in parallel in eachcycle (i.e., 16 bytes). Because we can process onlyone character at a time, N = n + 1, where n is thenumber of 0x7d. In other words, the frequency ofthe appearance of 0x7d determines the processingperformance. In practice, we assume 0x7d and 0x7eappear randomly, then the average expected value Eof the number of 0x7d is 0.125. In general, when n =0.125, P is about 20 Gb/s. Altogether we configureeight channels to run the above task. So, it is easyto implement line-speed processing.

Padding FSMRead buffer

FSM

Buffer0

Buffer1

FIFO

sele

ctor

‘Blank’data

6 bits

6 bits

128

128

Fig. 4 Data alignment component block diagram

2. Packet reconstitutionAs we described above, we perform the decapsu-

lation of GRE and GTP tunnel packets and charactertransformation of the PPP packet. However, the en-capsulation header actually contains some importantinformation for association of the user session, suchas the source and destination IP addresses, the GREkey, and the tunnel endpoint identifier (TEID) num-ber. As depicted in Fig. 5, we add an extra headerto transfer information between the hardware accel-erator and the multi-core processor. It reduces theprocessing pressure by decreasing the length of thepacket while ensuring information integrity. Thenthe control packet will directly transfer to the multi-core processor for gateway address learning, whichwe will describe in Section 3.3.2.


3.2.2 Distribution strategy

In addition to packet processing acceleration, wecombine the hardware accelerator with TCAM andSRAM to build a quick packet distribution mech-anism. In practice, our system receives packetsfrom multiple operator networks. Thus, when thedistribution step begins, we define four groups todistribute packets from different mobile operatorsto different back-end servers. Each group has acorresponding output port list, which can be pre-configured. Moreover, our system provides threedistribution approaches to allocate each packet toback-end servers. In most cases, we use the five-tuple in each packet, and flows will be split roughlyequally across paths of equal length. If a specific usermust be tracked, we propose a distribution approachbased on the user’s terminal IP address to ensurethe integrity of bidirectional data. The multi-coreprocessor restores the IP packet and extracts thePDSN/GGSN address. By identifying the locationof the PDSN in the IP address field, the transmissiondirection of the packet can be uniquely determined.Then we can identify the terminal IP address fromthe internal IP packet based on the transmission di-rection. We use the gateway IP address to allocatethe control packet because the back-end server needsto extract the international mobile subscriber identi-fication (IMSI) number and associate it with the userdata. Based on the above approach, the hardware ac-celerator calculates the hash function to match thecorresponding back-end server, and then rewrites thedestination MAC address of each packet.

3.3 Multi-core processor

With the rapid increase in mobile core networks,the traditional general-purpose CPU has many per-formance bottlenecks in the packet processing pro-cedure, which cannot meet the demands of a high-speed network environment. Thus, we implement

Self-defined header

PPP frame

Src&Dst IP GRE key GRE seq_num

MAC

...

Hardware accelerator

Multi-core processor

PPP frame

...

IP packet

IP packet

MAC

MAC

...

Fig. 5 Structure of self-defined packet transmission

some functions on a Broadcom XLP432, which isan SMP network processor consisting of 8 identicalcores with 32 threads, and a shared 8-MB L3 cache.After the hardware accelerator performs packet pre-processing, the multi-core processor performs furtherin-depth processing. Our goal is to extract and re-store the IP packet in each PPP frame and reorganizethe possible IP fragmentation. The whole procedureconsists of outer IP reassembly, decompression of theVJ compressed packet, internal frame reassembly,and out-of-order packet reassembly. These opera-tions have complex processing logic, so it can hardlybe implemented on an FPGA.

In the multi-core processor, there are severalpacket capture engines based on zero-copy and multi-core technology. The fast messaging network (FMN)is an important part of the multi-core processor. Asdepicted in Fig. 6, the FMN station is the connectionpoint between the various functional units of the pro-cessor and the FMN message ring. The multi-coreprocessor connects the functional units through theFMN station. Each physical core of the multi-coreCPU has a corresponding station on the FMN (e.g.,the XLP432 has eight physical cores, so there areeight stations and one-to-one correspondence). Eachstation has eight buckets; each bucket has a globalbucket ID, which is unique in the FMN message ring.If the station sends a message to a bucket, it will beplaced in the corresponding bucket according to thepurpose of the message, and each thread or interfacewill read the message from the bucket and processit. Based on the above technology, we first completethe reorganization of the outer IP fragment. Thenfor a packet that has a PPP fragment, we do reor-ganization and then decompress the VJ packet torestore the user data. Finally, we deal with the out-of-order packet. In the real link, we find that the

Fast messagingnetwork (FMN)

FMN station

Core 0

Core 1

Core 7

...

Thread(0–3)

Thread(4–7)

...

Thread(28–31)

KeyyNext

01

255

...

01

255

...

01

255222255555555255

...

KeynextKeyyNext

KeynextKeyyNext

01

255

...

01

255222255555555255

...

NULL

Packet in/out

NULL

Fig. 6 Processing of the fast messaging network


outer IP fragment accounts for less than 3% of thetotal packet stream. However, the compressed VJpacket will account for more than 20%. Hence, theprocessing efficiency of VJ packets will greatly af-fect the processing performance of the whole system.In the next section, we introduce the procedure fordecompressing the VJ packet.

3.3.1 Decompression strategy

Due to the limited bandwidth of the air link,information content is often transmitted using VanJacobson TCP/IP header compression technology.With VJ compression, the TCP payload is encap-sulated in the PPP protocol, and other informationfields are omitted (including the source and destina-tion IP addresses and the port number). However,without restoring the raw packet, it is difficult tomonitor and analyze user traffic.

To take full advantage of the multi-core proces-sor, we use 30 independent threads to process pack-ets in parallel and maintain a TCP header flow ta-ble, which is divided into 30 equal partitions. Asdepicted in Fig. 6, the rightmost list contains theunique identity (source, destination IP, and GREkey) of the TCP connection, the next pointer, andthe TCP header buffer pointer, pointing to the spacefor the different TCP connections, which contain 256connection numbers. Due to these indexes, packetsare split into each thread and do not access the tablespace of different threads, which means that the flowtables operate independently of each other. Mean-while, different threads do not need to use a sharedvariable lock mechanism, so there is no delay betweenthreads. Each thread independently completes thedecompression without any communication betweenthreads. Therefore, 30 message processing threadscan be fully parallel.

When processing the PPP frame, the multi-coreprocessor calculates the index value of the headertable according to the source IP, destination IP,and GRE key of the tunnel packet (the session usesthe GRE key to identify different PPP connections),and then obtains a unique TCP connection accord-ing to the connection number. For uncompressedTCP packets, we update the header of the TCPconnection, modify the protocol field, and directlyforward the packet. For compressed packets, theheader is restored according to the header of the TCPconnection.

3.3.2 Gateway address learning

This is another function that involves coopera-tion between the hardware accelerator and the multi-core processor. In a CDMA2000 mobile core net-work, the PDSN serves as the bridge that connectsthe mobile core network and the TCP/IP Internet.Similarly, in a WCDMA/TD-SCDMA mobile corenetwork, the above functional entity is called theGGSN. Each packet transferred through the mo-bile core network encapsulates the IP address of thePDSN/GGSN in its GRE/GTP header, and we callit the gateway address. In our system, we use thegateway address to achieve control packet distribu-tion and identify the user terminal IP address as theinput key value of the distribution.

As depicted in Fig. 7, by identifying the gatewayaddress, the transmission direction of each packetcan be uniquely determined. Then we modify thesource MAC address of each data packet by addinga flag bit to indicate the transmission direction.When the hardware accelerator receives the restoredpacket, the flag bit indicates the user’s terminal ad-dress. For instance, if the flag bit indicates that it isan uplink packet, the source IP address is the user’sterminal address. Meanwhile, the control packet istransmitted back to the hardware accelerator anddistributed through the gateway address.

PDSN/GGSN address

extraction

Transmission direction

Control packet

Output

Datapacket

Address tableaT

Confirm user terminal IP

Packetdistribution

Key

Key

Multi-core processor Hardware accelerator

Fig. 7 Gateway address learning scenario

3.4 Available extension

Due to the flexibility and scalability of the multi-core processor and FPGA, our system can also pro-vide serial access to the mobile core network for iden-tifying, detecting, and forwarding multiple applica-tion protocols, such as VoIP, HTTP, and email. Oursystem helps discover unsafe content and maliciousbehaviors in the network, and plays an importantrole in spam detection, traffic statistics, and attack


behavior analysis. In this extension, collection, filter-ing, and forwarding can be implemented in FPGA,and TCAM is responsible for rule matching. Themulti-core processor is responsible for protocol iden-tification and traffic management. Due to space lim-itations, we sketch only the modifications, and leavethe detailed descriptions to another paper.

Fig. 8 depicts the overview of the proposed ex-tension scheme of our system. When a packet isreceived, the manager component sets the networksegment that needs to be focused to the filtering andforwarding module. Then if the traffic matches therules, it will be copied and sent to the multi-coreprocessor for the detection steps as follows:

Behavior analysis

Anomaly detection

Traffic feature extraction

Protocol analysis

Collaborativedetection

Traffic cleaning

Processor

Filtering and forwarding

Distribution component

Continuous transmission

Manager

Traffic

Fig. 8 Structure of the extension scheme. The dottedlines indicate the transmission of internal informationand instructions; the solid and hollow arrows indi-cate the transmission directions of the mirrored traf-fic and the attack traffic which needs to be cleaned,respectively

Step 1 (protocol analysis): The multi-core pro-cessor uses mirrored traffic to analyze TCP and UDPpackets, manage traffic, and detect abnormal pack-ets. When video stream data must be processed, theregular expression method is essential to match thekey fields in the payload.

Step 2 (anomaly detection): This component ismainly for detecting DDoS attack through statisticalanalysis of traffic address entropy. Source addressentropy continuous increasing or destination addressentropy reduction indicates that the DDoS attackmay occur.

Step 3 (behavior analysis): This component in-volves statistical analysis of protocol behavior forsuspected victim hosts, such as the number of TCP

SYN and FIN packets in a specific time period.Step 4 (collaborative detection): This compo-

nent maintains a list of suspected victim hosts todetermine if an attack occurred. It also collects andsends all the suspicious information about the cur-rent network to the system manager, such as theaddress of the victim host and the type of attack.Orders from the manager can also be received andsent to other components.

Step 5 (feature extraction): This componentmaintains normal traffic characteristics, such as fluc-tuations in the TTL range, distribution of packetlength, and protocol, and assists in attack identifica-tion and traffic cleaning.

Step 6 (traffic cleaning): This component fil-ters the attack traffic based on session status andnormal/abnormal traffic characteristics.

Fig. 9 provides a snapshot of the implementa-tion of our pre-processing system. At the core of theboard is an Altera Stratix V GX FPGA, adjacent toan XLP432 multi-core processor produced by Broad-com. It can provide high-capacity traffic processingand distribution with 24×10 Gb/s POS input andoutput in standard SFP+ interfaces. Moreover, GEand COM interfaces have been set for providing con-sole, Telnet, and a dedicated remote configurationprotocol (RCP) to control the entire system and dothe batch configuration of filter rules.

24×SFP+

TCAM

Stratix V FPGAXLP432 processor

4 × DDR3 DIMM

Fig. 9 Implementation of the processing prototype

4 Experiments

In this section, we evaluate the performance ofthe real-time pre-processing system on a real dataset.We obtained real data from a collector located inthe Gn interface of a TD-SCDMA core network over


a period of 5 h from 11:00 a.m. to 4:00 p.m. inHunan Province on April 21, 2011. We collected2.69 billion packets, corresponding to 1.5 TB of TD-SCDMA traffic. Similarly, we collected data fromthe A10/A11 interface of a CDMA2000 core networkin Hebei Province and obtained 45.2 million packets,corresponding to 435 GB of CDMA2000 traffic. Ta-ble 1 shows the statistical analysis of different typesof PPP frames in the CDMA2000 dataset. We canobserve that more than 70% of the GRE encapsula-tion packets contain only one PPP frame, including30.99% of the complete IP packets, which can be di-rectly forwarded, 13.97% of the VJ compressed pack-ets, and 37.43% of the IP fragments. Furthermore,according to our statistical result, the VJ compressedpackets in the core network can account for 23%–25%of the total number of packets after recombination ofthe fragments. The remaining 30% indicates a GREpacket that contains multiple PPP frames, and mostof them involve at least one fragment (14.72% of thetotal number of packets).

Table 1 Proportion of each type of packet

Packet type Percentage Packet type Percentage

Single complete 30.99% (IP) Single piece 37.43%13.97% (VJ) Multi complete 2.53%

0.36% (others) Multi piece 14.72%

In practice, due to the low utilization of the realcore network, a realistic environment cannot providesufficiently large flow for our evaluation, and we usedan evaluation tool to simulate the real mobile corenetwork. Connected to a four-port optical splitter,we can generate a maximum input rate of 40 Gb/s.The experimental environment is shown in Fig. 10.

4.1 Performance evaluation among differenttypes of packets

To more accurately assess, we chose three differ-ent types of packets from the CDMA2000 dataset toevaluate the performance bottleneck. Fig. 11 showsthe packet loss of our system in case of decompres-sion of the 64-byte VJ compressed packet, the reor-ganization process of multi-internal fragments in oneGRE packet, and the reorganization process of inter-nal fragments across two different packets. The mainreason for choosing these packets is because of theirfrequent appearance in mobile core networks. Ac-cording to our statistics, VJ compressed packets and

internal fragments account for about 26% and 52%of overall traffic, respectively. The decompressionprocedure requires caching of the VJ uncompressedpacket while the subsequent VJ compressed packetis decompressed. Because the existing packets areout of order, multi-internal fragments and internalfragments cross two different packets and consume alarge number of resources while waiting for reorga-nization. In experiments, we extracted three sam-ples of the above packet types from the CDMA2000dataset. Although each sample has only a few pack-ets, we used IPRO to play back 106 times to evaluatethe performance of our system with or without thehardware accelerator. As depicted in Fig. 11, be-cause the hardware accelerator accomplishes the pro-cessing of packet decapsulation and character trans-formation, results indicate that the performance ofour system improves by 27%, 57%, and 50% whendealing with the above three packet types.

IPRO tester

Prototype

Optical output Control interface

Fig. 10 Demonstration environment

4.2 Performance evaluation of the overallsystem

We used IPRO to play back the overallCDMA2000 and TD-SCDMA traffic at differ-ent rates. To estimate the performance in theCDMA2000 core network, we ran an experiment ona typical server running Ubuntu Linux with a 3.5-GHz Intel Core i7 processor and 32 GB of mem-ory to compare the performance with our system.Fig. 12a indicates the distribution of packet lengthof the CDMA2000 dataset. We can observe that al-most 54% of packets were longer than 512 bytes andthe average length of a packet was 612 bytes. As il-lustrated in Fig. 12b, when the incoming packet rateis increasing, the typical server starts to drop pack-ets at 450 Mb/s, and the packet loss rate reaches65% when the incoming packet rate reaches 1 Gb/s.


Loss

(%)

Throughput (Gb/s)0

0

20

40

60

80

100

5 10 15 20 25

Multi-core onlyWith hardware accelerator

(a)

Loss

(%)

Throughput (Gb/s)0

0

20

40

60

80

100

5 10 15 20 25


(b)

Loss

(%)

Throughput (Gb/s)0

0

20

40

60

80

100

5 10 15 20 25


(c)

Fig. 11 Packet loss experiment on different types of packet: (a) 64-byte VJ; (b) multi-internal fragment;(c) cross fragment

Num

ber o

f pac

kets

(×10

6 )

Packet length (byte)

0

5

10

15

20

(0,63)

(64,127)

(128,255)

(256,511)

(512,1023)

(1024,1518)

(a)

Multi-core processor onlyTypical server

With hardware accelerator

Loss

(%)

Throughput (Gb/s)0

0

20

40

60

80

100

5 10 15 20 25

(b)

30 35

Multi-core processor onlyWith hardware accelerator

Thro

ughp

ut (G

b/s)

Packet size (Kbyte)0

0

5

10

15

20

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

(c)

Packet length (byte)

(0,63)

(64,127)

(128,255)

(256,511)

(512,1023)

(1024,1518)0

2

4

6

8

10(d)

Num

ber o

f pac

kets

(×10

8 )

Multi-core processor onlyTypical server

With hardware accelerator

Loss

(%)

Throughput (Gb/s)26

0

10

20

30

40

28 30 32 34 36

(e)

38 40

5

15

25

35

Multi-core processor onlyWith hardware accelerator

Thro

ughp

ut (G

b/s)

Packet size (Kbyte)0

10

15

20

25

30

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

(f)

Fig. 12 Results of system performance: (a)–(c) are the packet size distribution, packet loss rate, and throughputin the CDMA2000 core network, respectively; (d)–(f) are the packet size distribution, packet loss rate, andthroughput in the TD-SCDMA core network, respectively

Similarly, the multi-core processor drops packets at13 Gb/s, which is a nice result for a common usernetwork, but not enough for a mobile core network.When the incoming packet rate reaches 30 Gb/s,the multi-core processor has dropped more than 50%of the packets. Finally, because of the large work-load processed in the FPGA-based hardware acceler-ator, it has no packet leakage, and even the incomingpacket rate reaches 18 Gb/s. In Fig. 12c, we evalu-ated the throughput of different packet lengths. It isobvious that the throughput increases as the size ofthe packet increases and the processing performance

with the hardware accelerator is much better thanthat in the multi-core processor-only solution. InFigs. 12d–12f, we used the same evaluation scenarioon the TD-SCDMA core network dataset. First, weprovided the distribution of packet length in Fig. 12dand found that almost 46% of packets have a lengthof less than 128 bytes and the average packet lengthis 572 bytes. Then as shown in Figs. 12e and 12f, be-cause the GTP packet was completely processed andforwarded by the hardware accelerator, our systemwas able to process line-speed traffic.


4.3 Performance evaluation of the extension

Because some functions have not been imple-mented, we evaluated only the performance of theprotocol analysis component. We separated severalPOP3 sessions and video-on-demand (VOD) streamsfrom the CDMA2000 dataset to build a new syn-thetic dataset. The whole dataset includes 13 POP3sessions and 5 VOD streams, corresponding to 13 195packets. The IPRO tester was used to perform 104

times cyclic playback with randomly changing IP ad-dresses. For comparison, the protocol analysis com-ponent was also implemented in a typical server asmentioned above. Fig. 13 evaluates the performanceof our system with or without protocol analysis. Re-sults show that there is no significant reduction inperformance when processing the POP3 protocol.In contrast, due to the implementation of regularexpressions, the multi-core processor is less efficient,which will inevitably lead to a decline in the perfor-mance of our system. However, it still far exceedsthe traditional network server. In the future, we in-tend to employ a separate accelerator chip to achievehigh-speed regular expression matching, such as theNetlogic NLS2008.

Typical serverVOD streamPOP3Without detection

Loss

(%)

Throughput (Gb/s)0

0

20

40

60

80

100

5 10 15 20 25 30

Fig. 13 Performance evaluation of the extensionscheme

5 Conclusions

In this paper, we have proposed a real-time pre-processing system for mobile core network. For re-ducing the processing pressure on back-end detectionserver, an FPGA-based hardware accelerator and amulti-core processor were implemented to handle dif-ferent stages of packet processing. Evaluation resultsshowed that our system can achieve a speed of atleast 18 Gb/s with no packet loss.

ReferencesCavigelli, L., Gschwend, D., Mayer, C., et al., 2015. Origami:

a convolutional network accelerator. Proc. 25th Editionon Great Lakes Symp. on VLSI, p.199-204.https://doi.org/10.1145/2742060.2743766

Cheng, M., Sun, Y., Su, J., 2016. A real-time pre-processingsystem for mobile core network measurement. Proc.6th Int. Conf. on Instrumentation, Measurement, Com-puter, Communication adn Control, p.298-302.https://doi.org/10.1109/IMCCC.2016.73

China Communications Standards Association, Inc., 2006.Third Generation Partnership Project 2.

Cisco Systems, Inc., 2013. Cisco ASR 5000 Series Aggre-gation Services Router Installation and AdministrationGuide Version 10.0.

Go, Y., Jamshed, M.A., Moon, Y., et al., 2017. APUNet: re-vitalizing GPU as packet processing accelerator. Proc.USENIX Symp. on Networked Systems Design andImplementation, p.83-96.

Han, S., Jang, K., Park, K., et al., 2010. PacketShader:a GPU-accelerated software router. ACM SIGCOMMComput. Commun. Rev., 40(4):195-206.https://doi.org/10.1145/1851182.1851207

Intel Products, Inc., 2010. Crystal Forest Platform: ProductOverview.

Internet Society, Inc., 2014. Request For CommentsNo. 1661.

Kekely, L., Puš, V., Benáček, P., et al., 2014. Trade-offs andprogressive adoption of FPGA acceleration in networktraffic monitoring. Proc. 24th Int. Conf. on FieldProgrammable Logic and Applications, p.1-4.https://doi.org/10.1109/fpl.2014.6927443

Lavasani, M., Angepat, H., Chiou, D., 2014. An FPGA-basedin-line accelerator for memcached. IEEE Comput. Ar-chit. Lett., 13(2):57-60.https://doi.org/10.1109/l-ca.2013.17

Neil, D., Liu, S.C., 2014. Minitaur, an event-driven FPGA-based spiking network accelerator. IEEE Trans. VLSISyst., 22(12):2621-2628.https://doi.org/10.1109/tvlsi.2013.2294916

Peemen, M., Setio, A.A., Mesman, B., et al., 2013. Memory-centric accelerator design for convolutional neural net-works. Proc. IEEE 31st Int. Conf. on ComputerDesign, p.13-19.https://doi.org/10.1109/iccd.2013.6657019

Rizzo, L., 2012. Netmap: a novel framework for fast packetI/O. Proc. 21st USENIX Security Symp., p.101-112.

Vallentin, M., Sommer, R., Lee, J., et al., 2007. The NIDScluster: scalable, stateful network intrusion detectionon commodity hardware. Proc. Int. Workshop onRecent Advances in Intrusion Detection, p.107-126.https://doi.org/10.1007/978-3-540-74320-0_6

Vasiliadis, G., Polychronakis, M., Ioannidis, S., 2011.MIDeA: a multi-parallel intrusion detection architec-ture. Proc. 18th ACM Conf. on Computer andCommunications Security, p.297-308.https://doi.org/10.1145/2046707.2046741

Zhang, C., Li, P., Sun, G., et al., 2015. Optimizing FPGA-based accelerator design for deep convolutional neuralnetworks. Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, p.161-170.https://doi.org/10.1145/2684746.2689060

Documents

Real-timepre-processingsystemwithhardware ...Cheng et al. / Front Inform Technol Electron Eng 2017 18(11):1720-1731 1721 Products, Inc., 2010; Peemen et al., 2013; Kekely et al., 2014),