13
FreeBSD on Freescale QorIQ Data Path Acceleration Architecture Devices Michal Dubiel, Piotr Zięcik Semihalf, [email protected], [email protected] Abstract This paper describes the design and im- plementation of the FreeBSD operating sys- tem port for the QorIQ Data Path Acceler- ation Architecture, a family of communica- tions microprocessors from Freescale. These chips are modern, multi-core, PowerPC based SoCs, which feature a number of specifically designed peripherals, addressed for the high performance networking devices, which are in- creasingly common in modern communication infrastructure. The primary focus is the Data Path Ac- celeration Architecture (DPAA) with the new approach to network interface architecture. It has significant influence on the FreeBSD device drivers design and implementation. The paper describes how the full network functionality was brought forward, and also covers other major development tasks like the e500mc quad-core SMP bring up and support for other integrated devices. 1 Introduction Increasing number of services available through the Word Wide Network, creates new challenges for the telecommunication indus- try. Bandwidth requirements are continuously growing, which implies more and more packet processing power. Situation is getting worse, as security concerns has to be taken into ac- count. Modern telecommunication systems have to perform a real-time deep packet inspec- tion, classification and be immune to sophisti- cated attacks. To cope with the problems mentioned above, System-On-Chip (SoC) designers have been introducing specially dedicated architec- tures. These combine a sophisticated hardware modules, which take over more and more pro- cessing responsibilities from CPU cores. One example of such an architecture is Freescale QorIQ Data Path Acceleration Architecture, which is the platform for this work. In order to utilise the facilities brought by those architectures, a new software is necessary. There have already been several advanced, se- cure and stable operating systems like FreeBSD available for years. It seems obvious to com- bine their powers with the capabilities of the new hardware. In this paper, a FreeBSD port for the Freescale QorIQ Data Path Acceleration Ar- chitecture SoCs is presented. Design and im- plementation details, development process and challenges encountered are all given. Also, an overview of the QorIQ DPAA SoC hardware is outlined to help better understand design and implementation decisions made. This work is based on the existing FreeBSD port for Freescale PowerQUICC III series of SoCs ([E500BSD]). The result is a fully functional FreeBSD system running on Freescale QorIQ DPAA SoCs, which facilitates the DPAA hardware components to achieve sig- nificant network performance advance. 2 Hardware Freescale QorIQ Data Path Acceleration Architecture consists of up to eight Pow-

FreeBSD on Freescale QorIQ Data Path Acceleration Architecture Devices

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FreeBSD on Freescale QorIQ Data Path Acceleration Architecture Devices

FreeBSD on Freescale QorIQ Data PathAcceleration Architecture Devices

Michał Dubiel, Piotr ZięcikSemihalf,

[email protected], [email protected]

Abstract

This paper describes the design and im-plementation of the FreeBSD operating sys-tem port for the QorIQ Data Path Acceler-ation Architecture, a family of communica-tions microprocessors from Freescale. Thesechips are modern, multi-core, PowerPC basedSoCs, which feature a number of specificallydesigned peripherals, addressed for the highperformance networking devices, which are in-creasingly common in modern communicationinfrastructure.

The primary focus is the Data Path Ac-celeration Architecture (DPAA) with the newapproach to network interface architecture. Ithas significant influence on the FreeBSD devicedrivers design and implementation. The paperdescribes how the full network functionality wasbrought forward, and also covers other majordevelopment tasks like the e500mc quad-coreSMP bring up and support for other integrateddevices.

1 Introduction

Increasing number of services availablethrough the Word Wide Network, creates newchallenges for the telecommunication indus-try. Bandwidth requirements are continuouslygrowing, which implies more and more packetprocessing power. Situation is getting worse,as security concerns has to be taken into ac-count. Modern telecommunication systemshave to perform a real-time deep packet inspec-tion, classification and be immune to sophisti-cated attacks.

To cope with the problems mentionedabove, System-On-Chip (SoC) designers havebeen introducing specially dedicated architec-tures. These combine a sophisticated hardwaremodules, which take over more and more pro-cessing responsibilities from CPU cores. Oneexample of such an architecture is FreescaleQorIQ Data Path Acceleration Architecture,which is the platform for this work.

In order to utilise the facilities brought bythose architectures, a new software is necessary.There have already been several advanced, se-cure and stable operating systems like FreeBSDavailable for years. It seems obvious to com-bine their powers with the capabilities of thenew hardware.

In this paper, a FreeBSD port for theFreescale QorIQ Data Path Acceleration Ar-chitecture SoCs is presented. Design and im-plementation details, development process andchallenges encountered are all given. Also, anoverview of the QorIQ DPAA SoC hardware isoutlined to help better understand design andimplementation decisions made.

This work is based on the existingFreeBSD port for Freescale PowerQUICC IIIseries of SoCs ([E500BSD]). The result isa fully functional FreeBSD system running onFreescale QorIQ DPAA SoCs, which facilitatesthe DPAA hardware components to achieve sig-nificant network performance advance.

2 Hardware

Freescale QorIQ Data Path AccelerationArchitecture consists of up to eight Pow-

Page 2: FreeBSD on Freescale QorIQ Data Path Acceleration Architecture Devices

Figure 1: QorIQ P3041 Communication Processor (Source: [P3041FS]).

erPC cores supporting virtualization, security-enhanced interconnect fabric and several hard-ware components offloading CPUs from packetprocessing. The Figure 1 presents internalstructure of the P3041 SoC, a middle-classmember of the QorIQ DPAA family. The fol-lowing subsections describe the architecture ingreater details.

2.1 e500mc core

The e500mc core is a next generationof the e500 core described in [E500BSD] withgreater details. The extensive description ofthe core is out of the scope of this paper andonly major improvements of the e500mc coreare presented here.

From the perspective of performance, thee500mc core has introduced an integrated L2cache and an improved FPU. Furthermore, theTLB sizes have been increased. There are 512constant size entries and 64 variable size en-tries. The core also come out with a decoratedload/store instructions.

The decorated load/store instructionsadd meta-data (decoration) to the basicload/store operations. This meta-data are usedto define additional actions, which may be per-formed during I/O transactions. For example,

the value that is being stored can be added tothe actual word located in the memory insteadof just overwriting it. The QorIQ DPAA SoCsdefine several decoration actions providing ba-sic arithmetic and logic operations as well asmin/max and accumulation functions.

Another new feature introduced by thee500mc core is an additional execution privilegelevel (right above the supervisor level presentin the classic PowerPC architecture), whichextends the virtualization capabilities. Pro-gram running on this level is allowed to useanother unique feature of the core — externalPID load/store. These are special instructions,which execute I/O requests in the different con-text (defined by hypervisor) than the currentrunning one.

The core also provides better integra-tion within the SoC. Power management ofboth the core and the SoC is now combined,which reduces programming efforts. Inter-corecommunication has been improved by addingmessage-send and message-clear instructionsusing a new core2core doorbell interrupt.

The last but not the least feature intro-duced by the e500mc core is cache stashing.With assistance of the SoC interconnect, se-lected part of any given device ↔ memory

2

Page 3: FreeBSD on Freescale QorIQ Data Path Acceleration Architecture Devices

transfer can be placed also in the CPU L2cache. The stashing mechanism is targeted toaccelerate software packet processing. For ex-ample, every time when Ethernet Interface re-ceives TCP/IP packet, the first couple of bytes,containing protocol headers, can be stored inthe cache. Then, the CPU during packet pars-ing will fetch data directly from L2 cache in-stead of performing expensive memory access.

2.2 CoreNet Interconnect Fabric

The e500mc core and all major periph-erals are connected with CoreNet InterconnectFabric, which represents modern Network-on-Chip approach. Its design solves security prob-lems introduced by virtualization. Guest op-erating systems are not able to access eachother directly (due to protections built-in thee500mc core), however they are able to config-ure a hardware component (such as DMA en-gine) to access all available memory, includingareas occupied by other guest OSes. To avoidsuch situation, called DMA-attack, hypervisorvirtualizes all accesses to the hardware. Thissimple solution has significant performance im-pact, which can be avoided by using PeripheralAccess Management Units (PAMUs) integratedin the CoreNet interconnect.

2.2.1 Peripheral Access ManagementUnit

The Peripheral Access Management Unit(PAMU) is a hardware, through which everyI/O and memory transaction passes. Eachtransaction, characterised by Logical I/O De-vice Number (LIODN) and address, mustmatch a specific set of rules defined by soft-ware in the very similar way as MMU entries.The matching rule may allow, deny or redirectthe transaction to other destination, as wellas modify its cache coherency and stashing at-tributes. Worth noting is that LIODN consistsof two fields. One is hardwired to requestingdevice, whereas second may be generated bythe hardware by using data included in a re-quest. For example, Ethernet controller’s DMAmay assign specific LIODNs to packets receivedfrom given 802.1q VLANs.

2.2.2 Platform Cache

Another interesting component of QorIQDPAA SoCs is Platform Cache. In differenceto CPU caches, the Platform Cache is placedbetween memory controller and CoreNet inter-connect. As a result all requests to memory areaccelerated, not only these generated by pro-cessors. This reduces memory controller con-tention by small transfers, such as receptionand transmission of small Ethernet packets.

2.3 Data Path Acceleration Architec-ture

The Data Path Acceleration Architectureis the key component characterising QorIQDPAA SoC family. It provides an infrastruc-ture for hardware-accelerated packet process-ing, such as bridging, routing, packet filteringand IPSec processing, to name just a few. TheDPAA offloads CPU in four areas: data buffermanagement, queue management, packet dis-tribution, and policing.

Data Buffer management is realised bya specialised component called Buffer Manager(BMan). Data buffers can be allocated anddeallocated (by software and hardware com-ponents) from user-defined pools. The BManautomatically manages these pools, only sig-nalling depletion state when the number ofbuffers in the pool falls below predefined thresh-old level.

Queue management is realised by QueueManager (QMan). Data buffers (not necessar-ily allocated form BMan) might be organisedinto frames containing one or more buffers, andthen put further into queues, which are fullymanaged by the QMan. The queues provide aneffective data exchange infrastructure betweenCPUs and/or DPAA components.

The Frame Manager (FMan) is responsi-ble for packet distribution and policing. Eachframe can be parsed, classified and resultsmight be attached to the frame. This metadata might be used to select particular QManqueue, which the packet is forwarded to, or to

3

Page 4: FreeBSD on Freescale QorIQ Data Path Acceleration Architecture Devices

Figure 2: Example of QorIQ Data Path Accel-eration Architecture (Source: [DPAARM]).

control flow bandwidth. The FMan supportsboth online (when packets arrives from inte-grated MACs) and offline (when packets aretaken from memory) processing.

The Figure 2 presents the interactions be-tween the DPAA components. Some of themare described in greater details in the followingsections.

2.3.1 Buffer and Queue Manager

As it was mentioned, the BMan holds andmanages pools of data buffers. Each buffer isrepresented only by a pointer. To allocate thebuffers, software just reads the pointers fromspecial set of registers (called portals). Deallo-cation is just an opposite: software has to writethe pointers to the portal. There are up to tenportals, which can be assigned to the CPU coreand/or application. As accesses to the por-tal are atomic, no locking is required as longas the portal is not shared. Such mechanismeliminates lock contention, which greatly de-grades performance in multi-core systems. Thesame access method applies to Queue Manager.Putting a frame into a queue is in fact writingof a frame descriptor into a QMan portal.

Portals are also able to signal various con-ditions by raising an interrupt. For example,QMan portals asserts interrupt when there aremore than predefined number of frames in thequeue, or when a frame is being held in thequeue over a specified time.

2.3.2 Frame Manager

Frame Manager is central and the mostcomplex part of the Data Path Acceleration Ar-chitecture. It is responsible for packets recep-tion, transmission, parsing, classification and fi-nally routing them into relevant QMan queues.Simply speaking, it is the heart of a hard-ware packet processing. The above tasks arefulfilled by dedicated FMan sub modules, de-scribed later on in this subsection.

There are several 1Gbit/s and 10Gbit/sMedia Access Controllers (MACs) integratedinto the Frame Manager, each associated withsingle physical Ethernet interface. Its num-bers vary from one SoC version to another, butin most cases there are five 1Gbit/s and one10Gbit/s interfaces. Those are directly respon-sible for packet transmission and reception overthe media.

Whereas FMan MAC connects FMan tonetwork, Buffer Manager Interface (BMI) andQueue Manager Interface (QMI) sub modulesprovide communication means with the BufferManager and the Queue Manager respectively.Those are necessary, as full potential of theData Path Acceleration Architecture can onlybe exploited while using all BMan, QManand FMan components cooperating with eachother.

FMan MACs, BMI and QMI sub modulesoffer packet frames interchange between theFrame Manager, network and other SoC mod-ules like CPU cores, Security Engine (SEC),etc. However, as it was mentioned earlier,Frame Manager allows also for packet parsing,classification and policing. These features areprovided by FMan Controller, Parser, KeyGen,Policer and Frame Processing Manager (FPM).

A central unit responsible for hardwareframes processing inside the FMan is FrameProcessing Manager. It distributes frame pro-cessing tasks amongst the other FMan submodules, where those are processed next.

The FMan Controller, Parser, KeyGenand Policer form so called Parse Classify and

4

Page 5: FreeBSD on Freescale QorIQ Data Path Acceleration Architecture Devices

Distribute (PCD) flow. The Parser is usedto identify the incoming frames. It recognisesmany standard protocols and also allows forcustom and/or future protocol parsing. Its out-put, called parse result, is then used by classi-fication modules.

There are two categories of frame classifi-cation: hash based and table lookup. The for-mer is realised by the FMan KeyGen sub mod-ule. Hashing can be performed from many dif-ferent fields in the frame and/or from the parseresult. The second, table lookup, is performedby the FMan Controller. It looks up certainfields in the frame, which are selected by thecombination of user specified configuration andwhat fields the FMan Parser actually encoun-tered. The classification of frames (either hashbased, table lookup or both) determines a sub-sequent action to take.

An example of such an action mightbe frame policing, performed by the FManPolicer. It supports many policing profilesbased on two-rate, three-color marking algo-rithm ([RFC2698]). Burst sizes, sustained andpeak rates are all user-configurable.

Finally, results obtained from the FMansub modules described above are used to selecta queue to which a frame is to be enqueued.Those queues might be associated with differentmodules within the SoC (CPU core, SecurityEngine, etc.), which dequeues the frames andperform further processing.

For example, the FMan MAC receivesa frame and forwards it to the FMan Parser.After parsing, it turns out that the frame is en-crypted. An encryption keys are then attachedto the frame and it is send to the Security En-gine, where it is decrypted without using anyof the CPU cores. Next, the already encryptedframe is given back to the FMan for furtherclassification, policing and then send to eitherthe CPU core for software processing or trans-mitted elsewhere through the network.

Each FMan sub module is highly config-urable. Many of its functionalities can be tunedin a run time, which makes the FMan a power-ful hardware frame processor.

3 Software

Having a first-class hardware is worthnothing without a high quality software. Inthis chapter, process of porting the FreeBSD tothe QorIQ DPAA devices and challenges in theDPAA integration with the OS’s TCP/IP net-work stack are presented. The whole process,step by step, starting from toolchain adjust-ment and ending with fully functional FreeBSDsystem utilising facilities of the QorIQ DPAAplatform is described in the following subsec-tions.

3.1 Toolchain

Introduction of the new e500mc core hasimplied a necessity of toolchain adaptation.The existing FreeBSD PowerPC code had al-ready supported the e500mc predecessor, thusonly addition of a few new instructions wasrequired. To fulfil this, appropriate changesto both binutils and gcc were applied. Fortu-nately, the e500mc patch for binutils had beenalready available through the community andgcc patch had been delivered to us by SoCs sup-plier — Freescale.

3.2 Early kernel initialisation in lo-core.S

The locore.S contains an assembly, archi-tecture dependent code, which is executed atthe very beginning of the FreeBSD start up.The main goal of this code is preparation ofthe environment for C-code execution.

In PowerPC architecture, the locore.S isresponsible for TLB and kernel stack initialisa-tion. This topic is well described for the e500core in [E500BSD], in the example of MPC8572SoC bring up. This subsection focuses onlyon the changes introduced by the new e500mccore.

The first thing that differs e500mc fromits predecessors is a bigger Transaction Look-aside Buffer (TLB). This has implied slight

5

Page 6: FreeBSD on Freescale QorIQ Data Path Acceleration Architecture Devices

changes of bit-fields length in registers control-ling the TLB. Because the locore.S code usesthese registers in multiple places, this smallhardware change has enforced thorough line byline analysis and many adjustments in the as-sembly code.

Besides the bigger TLB, new hypervisorprivilege level present in e500mc core has intro-duced new MMU assist registers (MAS). Theseregisters have to be set accordingly during eachTLB access. Simply adding direct referencesto those new registers would broke backwardcompatibility, as accesses to non-existent regis-ters are causing exception and eventually sys-tem hang-up (exceptions in this stage can notbe handled yet). To cope with this, spe-cial procedures were added (zero_mas7 andzero_mas8). This routines perform a run-timecore identification and write to the MAS regis-ters (MAS7 andMAS8 respectively) only if therunning processor actually has those. Analo-gous tweaks were necessary to handle HardwareImplementation-Dependent Registers’ (HIDs)accesses correctly.

The machine-dependent portion of vir-tual memory subsystem (pmap(9)) also had tobe slightly changed. However, as the mostpmap(9) modifications are related to Symmet-ric Multi-Processing, they have been describedin appropriate chapter: 3.4.

3.3 QorIQ Data Path Acceleration Ar-chitecture

As OpenPIC and UART drivers had al-ready existed in the FreeBSD, the next ma-jor task was networking bring up. Becauseof the SoC architecture, this involved creationof device drivers for variety of DPAA compo-nents, including the most complicated one —the Frame Manager.

During the research, preceding drivers de-velopment, it was found that effort could begreatly reduced by using drivers from FreescaleNetCommSw software pack. The NetCommSwis a packet processing framework, able to run on

Figure 3: NetCommSw Driver model.

a bare-bone hardware. It consists of low-levelOS agnostic device drivers, protocol stacks anddevelopment tools, everything targeted to rapiddevelopment. Unfortunately, this software packis proprietary licensed.

However, it was found that one part ofthe NetCommSw — the Frame Manager driver,had been already available under the BSD li-cense elsewhere. Utilising a ready to use driverfor the most complex DPAA hardware com-ponent, greatly reduced the development time.Moreover, Freescale, who is the owner of theNetCommSw software, agreed to release addi-tional parts of this suite under the BSD license.

In the end, all necessary for this workNetCommSw Low-Level Drivers (Buffer Man-ager, Queue Manager and Frame Manager)were available under the BSD license and readyto integrate into the FreeBSD Project.

3.3.1 NetCommSw drivers integration

All NetCommSw drivers share the sameprogramming model, shown on the Figure 3,which is optimised for easy integration withvarious operating systems. In the centre of themodel the Low-level Device Drivers lie. The

6

Page 7: FreeBSD on Freescale QorIQ Data Path Acceleration Architecture Devices

drivers export high-level device-specific object-oriented API abstracting device functionalities.Only the middle part of this model, the Low-level Device Drivers, had been provided byFreescale. Both Wrapper Drivers and XX Rou-tines had to be implemented.

All accesses to the hardware are per-formed through several operating systemshooks called XX Routines. Those are also usedto obtain OS resources like locks and memory,as well as to register interrupt handlers. TheXX layer makes the drivers fully OS-agnostic.

A Wrapper Driver, placed over the Low-level Device Driver, is also required. It is re-sponsible for attachment to the operating sys-tem and translation of OS calls to the device-specific Low-level Device Driver API. At thislevel all accesses to the hardware have to beserialised. This implies lock utilisation.

The XX Routines interface is quitestraightforward and well defined. Mappingit to the FreeBSD kernel’s internals was nota hard task (what should be a body of theXX_Malloc() function?). Nevertheless, not allfunctionalities required by this layer had beenalready available in the FreeBSD kernel. Themost clear example is physical to virtual ad-dress translation.

Such translation is not used and thus notsupported in the FreeBSD kernel at all. More-over PA to VA mapping is ambiguous, as multi-ple virtual pages might be linked with one phys-ical page. To deal with this problem a list of allactive mappings referencing given physical pagehas been added to the machine-dependent partof vm_page structure. Each time, when VAto PA mapping is created, the pmap_enter()function adds new entry to the list. Alike, theentry is removed when given mapping is de-stroyed in pmap_remove().

The XX Routines use a given physical ad-dress to get the appropriate vm_page structureand then utilise it to obtain a list of activetranslations. In case of multiple mappings, onlythe first found (last added) entry is used to ex-tract virtual address.

The other class of problems found duringXX Routines implementation was this relatedto the Symmetric Multi-Processing. Those aredescribed in appropriate section 3.4.1 of thisarticle.

Writing of the Wrapper Drivers, which arejust newbus compliant attachments, was almosteffortless. However, as well as in the case ofXX Routines the SMP implementation requiredadditional work, also described in section 3.4.1.

3.3.2 Frame Manager driver

Integration of Frame Manager Driver re-quired a special care. Unlike Buffer Managerand Queue Manager, the FMan consists of sev-eral subsystems (described in section 2.3.2).Each of them is represented by separate APIsin the NetCommSw Frame Manager Low-levelDriver. Because of that, no single WrapperDriver was created, but instead some sort ofsoftware partitioning was applied.

At first, the common part used by all submodules was implemented as a single driver inthe FreeBSD newbus approach. It is responsi-ble for early Frame Manager initialisation andalso manages internal FMan resources used bysub module’s specific drivers.

Every driver utilising any of the FrameManager capabilities, has to acquire resourcesfrom the FMan driver. One of the exampleis a Datapath Triple-Speed Ethernet Controller(dTSEC) driver, which implements the entirenetwork functionality.

3.3.3 Datapath Triple-speed EthernetController driver

From a perspective of operating system,the dTSEC is a classical network device driver.It implements a set of special networking in-terfaces defined by the FreeBSD kernel, ex-tensively utilising the following DPAA compo-nents: Buffer Manager, Queue Manager andFrame Manager.

7

Page 8: FreeBSD on Freescale QorIQ Data Path Acceleration Architecture Devices

Figure 4: dTSEC packet transmission data flow.

Figure 5: dTSEC packet reception data flow.

First thing the dTSEC driver does, isthe configuration of FMan sub modules: BMI,QMI, FPM and MAC. Those combine a mini-mal set of required components that have to beinitialised in order to be able to send and re-ceive Ethernet packets. All these units are fullyabstracted by two NetCommSw Frame Man-ager Low-Level Drivers’ interfaces: FMan MACand FMan Port.

The FMan MAC interface is used to com-municate with the MAC sub module only,whereas FMan Port interface abstracts accessesto the FPM, BMI and QMI. It is, hence, theonly way to control data flow from and to theFrame Manager. Moreover, each instance of theFMan Port interface is able to handle either thetransmission or the reception flow and can beassociated with exactly one MAC. As a conse-quence, each Ethernet network interface driver

has to utilise at least two FMan Ports and oneFMan MAC instances.

But this is not enough to exchange net-work packets with the MAC. Since Frame Man-ager uses BMI and QMI for all accesses tosystem memory, Buffer Manager and QueueManager have to be also utilised by the dT-SEC driver. For example, received frames’data is written to memory buffers providedby Buffer Manager and simultaneously repre-sented as a new frame enqueued to the QueueManager.

On the transmission path, the dTSECdriver creates two QMan Frame Queues, andattaches them to FMan TX Port. The firstqueue, called Transmission Queue, is used tofeed the Frame Manager with a data to betransmitted. The driver only enqueues frames

8

Page 9: FreeBSD on Freescale QorIQ Data Path Acceleration Architecture Devices

into the queue. Nothing else is necessary.Transmitted frames are then returned throughthe second queue, called Transmission Confir-mation Queue. After checking the status codefor errors, the driver frees memory associatedwith the transmitted frames.

On the reception path, the dTSEC drivercreates only one QMan Frame Queue and oneBuffer Pool managed by the Buffer Manger.Both the Frame Queue and the Buffer Pool arethen associated with the FMan RX Port. Afterthe FMan MAC receives a packet, a free Bufferis allocated from the given Buffer Pool to holdthe received data. Then, a frame referencingthe buffer is created and enqueued to selectedFrame Queue. The driver job is to only fetchframes from the Frame Queue associated withthe FMan Rx Port.

Quantity of free buffers in the Buffer Poolis maintained by the Buffer Manager itself.When a number of the buffers in pool falls be-low the defined depletion threshold, an inter-rupt is asserted. As a result the BMan drivernotifies the dTSEC driver, which uses uma(9)zone allocator to provide fresh memory buffers.Buffers used by received packets are releasedback to the Buffer Manager Pool, unless it haspredefined number of buffers. Otherwise, therecycled buffers are freed back to uma(9).

Described above data flow is illustratedon the Figure 4 and the Figure 5.

As all data between dTSEC driver andhardware transfers are performed through spe-cialised BMan an QMan portals, briefly de-scribed in 2.3.1, the software overhead is negli-gible. Moreover, the performance is further im-proved by interrupt throttling embedded intothe Queue Manager.

3.3.4 DPAA in newbus hierarchy

As was mentioned earlier, all DPAA com-ponents had been implemented as newbus com-pilant drivers. Their layout in driver hierarchyreflects SoC memory map. The drivers are at-tached and initialised by applying depth-first

Figure 6: DPAA in newbus hierarchy.

tree search algorithm to this structure. Thisstays in conflict with DPAA hardware initiali-sation requirements.

To cope with this situation a virtual busnamed DPAA was introducted. The driversutilising more than one DPAA components(BMan, QMan and FMan) had to be placedunder this bus (see the Figure 6 for details),which ensures proper initialisation order.

3.4 SMP bring up

The QorIQ Data Path Acceleration Ar-chitecture SoCs are equipped with up to eighthe500mc cores. The possibility of utilising itspower using Symmetric Multi-Processing, givessystem designer an additional advantage.

The FreeBSD had already supportedSMP on the e500mc core predecessors (see[E500BSD]). As differences introduced by thee500mc core are not directly related to multi-processor support, it seemed that enabling theSMP support would not require extra effort.

However, existing model of Inter Proces-sor Interrupt (IPI) handling had not taken intoaccount hardware limitations. It worked wellwith up to two cores, but adding new one dis-turbed inter-core communication. When an IPIwas send to multiple cores, only one of them ac-tually received the message.

Investigation showed that IPI messageswere send one-by-one through single communi-cation channel without its availability checking.

9

Page 10: FreeBSD on Freescale QorIQ Data Path Acceleration Architecture Devices

Figure 7: DPAA Portal mappings.

As the hardware treats communication channelas busy until target core acknowledges messagereception, only the first message in the burstwas actually delivered.

To cope with this problem IPI multi-casting was implemented. Instead of send-ing IPI messages to multiple cores sequentially,a one multi-cast message is used. This changeallowed for successful bring up of quad-coreSMP.

Nevertheless, the integration of SMP withData Path Acceleration Architecture enforcedfurther deep changes in both the FreeBSD andNetCommSw drivers.

3.4.1 DPAA in SMP environment

As it was described in 2.3.1, accessesto Buffer and Queue Managers are performedthrough special registers called portals. Whilethe accesses from different CPU cores are per-formed through separate portals, there is noneed for serialisation. This approach eliminatesbottleneck caused by lock congestion in SMPenvironment.

In order to utilise this advantage, it wasdecided to assign one dedicated portal to eachCPU core. The core↔ portal mapping is trans-parent to CPU. Each portal occupies the samevirtual address on each core (see Figure 7) fordetails).

However, existing FreeBSD SMP supportfor e500 family allowed for only one, shared byall cores, device mapping. Originally, all de-vice mappings were simply copied from boot-strap CPU to application CPUs. This approachhas been partially reused. Now, only mappingsmarked as shared are copied. The mark is lo-cated in special user-defined bits, available inTLB entries of the whole e500 family. This sim-ple change required significant modification ofpmap(9) part responsible for device mappings,as well as CPU start-up procedures.

The new approach has also impliedchanges in NetCommSw Wrapper Drivers.When FreeBSD drivers are attached, only thebootstrap CPU is alive. Thus, mappings forthe application CPUs can not be set at thisstage. Beside that, the driver also has to con-figure each mapped portal.

In consequence, after FreeBSD start up,only one portal attached to bootstrap CPU isaccessible. But, as all cores are running atthis stage, accesses to drivers might be initi-ated from any of them. Hence, the WrapperDrivers are responsible to map and configureportals during the first access from any of theapplication CPUs.

However, run-time portal creation causeda new problem to appear. From NetCommSwWrapper Driver perspective, portal configu-ration is reduced to just few calls to Net-CommSw Low-Level Driver. Then, Low-Level

10

Page 11: FreeBSD on Freescale QorIQ Data Path Acceleration Architecture Devices

Driver, uses XX Routines to obtain OS andhardware resources. One of the types of al-located resources is an interrupt. In theFreeBSD, requesting an interrupt might sleepin intr_event_create(), hidden deeply in theoperating system. As portal might be createdat any time, the sleep during interrupt requestintroduces several locking issues.

For example, the FreeBSD TCP/IP stackmight send a packet to network drivers holdinga lock related to network protocol. If sendingthe packet requires portal configuration on anyof the application processors, then sleep withlock held may occur, which is prohibited.

To deal with this problem, an additionallayer for interrupt management in XX Rou-tines has been introduced. Interrupt registra-tion calls from NetCommSw Wrapper Driversare not directly translated to its FreeBSD coun-terparts, but instead obtained from the newlayer, which preallocates them during the sys-tem start up.

Nevertheless, the interrupt allocation isnot the only problem here. An interrupt, as-serted by any given portal, might be scheduledfor execution on any of the processor cores. Aseach core has access to only single dedicatedportal, the incorrectly scheduled interrupt rou-tine will service other portal than this assignedto it. Thus, the additional interrupt manage-ment layer in XX Routines has been also maderesponsible for binding the interrupt threads toproper CPU cores.

4 Other peripherals bring up

Besides the e500mc core, SMP andDPAA, support for other peripherals like PCIExpress, USB EHCI controller, I2C controllerand internal SoC’s DMA was added. However,these devices had been already well supportedin the FreeBSD kernel and bring up effort wasreduced to proper assignments of system re-sources with those drivers only.

5 Current state and results

Currently, the FreeBSD fully supportsthree members of a QorIQ Data Path Accelera-tion Architecture family: P2041 (quad core, 1.2GHz), P3041 (quad core, 1.5 GHz) and P5020(dual core, 2.0 GHz). Besides the DPAA sub-system, there are ready to use drivers for theUSB 2.0 EHCI, SD/MMC Controller and otherperipherals like UARTs, I2C and DMA con-trollers. The built-in PCI Express is also sup-ported.

Network tests, has shown that theDPAA architecture significantly reducesCPU utilisation related to packet processing.On P3041 SoC, the iperf tool (# iperf -c<iperf-server> -l 16M -w 128k -t 60)gave 897 Mbit/s transfer rate through singleEthernet interface. The CPU utilisation duringthe test was below 30%. The 90% of this loadwas caused by interrupt servicing and could bereduced by enabling polling mode, which hasnot been implemented yet.

The following listing, showing theFreeBSD interrupt counters, lay out almostequal distribution of packets between differentQueue Manager portals (interrupts 120-126)and thus CPU cores respectively.

p3041# vmstat -iinterrupt total rateirq121: bman0 1 0irq120: qman0 2784930 3584irq122: qman0 2194130 2823irq124: qman0 2263079 2912irq126: qman0 2148167 2764(...)

Also, a small number of interrupts as-serted by Buffer Manager shows that memorybuffers management is performed entirely bythe hardware and software intervention is notnecessary (the one, visible interrupt is assertedduring initial Buffer Pool creation).

11

Page 12: FreeBSD on Freescale QorIQ Data Path Acceleration Architecture Devices

6 Future work

Although current state of the FreeBSDQorIQ DPAA port is stable and runs on the ma-jority of the QorIQ DPAA family SoCs, supportfor some integrated peripherals is still missing.This includes built-in SATA controller and fewmembers of the DPAA subsystem like PatternMatching Engine (PME) and Security Engine.

Also not all features of the DPAA, thatcould be used by the FreeBSD TCP/IP stackare available. The pooling mode, hardwarepacket checksumming and Jumbo Frames arenot supported.

Worth trying would be integration of ad-vanced Frame Manager capabilities, like fullyhardware IPSec processing with the FreeBSDTCP/IP stack.

7 Conclusions

The FreeBSD can be successfully usedon the QorIQ Data Path Acceleration Archi-tecture SoCs. It is also able to utilise ma-jority of the features available in the hard-ware. Although the full hardware packet pro-cessing provided by those SoCs is not sup-ported, which would require deep changes inthe FreeBSD TCP/IP stack, the overall net-work performance achieved in this work ispromising.

8 Summary

In this paper, the FreeBSD port for QorIQData Path Acceleration Architecture SoCs waspresented.

First, an overview of the QorIQ DPAASoC hardware was given, including the newe500mc core and CoreNet Interconnect. Spe-cial attention was put on the DPAA compo-nents: Buffer Manager, Queue Manager andFrame Manager.

Next, process of porting the FreeBSDto the QorIQ DPAA devices, along with the

challenges encountered during the DPAA inte-gration with the OS’s TCP/IP network stackwere presented. This included new toolchainadaptation, early kernel initialisation (locore.Spart), DPAA components’ drivers implementa-tion and Symmetric Multi-Processing bring up.

Finally, the result of the porting processwas given. It consists of a fully functionalFreeBSD system running on P3041 SoC andutilising the facilities of the Data Path Acceler-ation Architecture to achieve considerable net-work performance boost.

9 Acknowledgements

Special thanks go to the following people:

Rafał Jaworowski (Semihalf, TheFreeBSD Project), mentor of this project.

Phil Brownfield (Freescale), for help andsupport with relicensing the NetCommSw codeand device tree source files.

Zbigniew Bodek, Piotr Nowak, TomaszNowicki, Jan Sięka, Łukasz Wójcik (all Semi-half), for all the work on this project.

Work on this paper was sponsored by Semihalf.

10 Availability

The code described in this paper is (or willsoon be) available from the FreeBSD ProjectSubversion repository, CURRENT branch.

References

[E500MC] Freescale Semiconductor, Inc.,e500mc Core Reference Manual, Rev. 009/2011

[DPAARM] Freescale Semiconductor, Inc.,QorIQ Data Path Acceleration Architec-ture (DPAA) Reference Manual, Rev. 211/2011

12

Page 13: FreeBSD on Freescale QorIQ Data Path Acceleration Architecture Devices

[P2040RM] Freescale Semiconductor, Inc.,P2040 QorIQ Integrated Multicore Com-munication Processor Family ReferenceManual, Rev. 0 11/2011

[P3041FS] Freescale Semiconductor, Inc.,P3041 Fact Sheet, Rev. 3 08/2011

[P3041RM] Freescale Semiconductor, Inc.,P3041 QorIQ Integrated Multicore Com-munication Processor Family ReferenceManual, Rev. 0 11/2011

[P5020RM] Freescale Semiconductor, Inc.,P5020 QorIQ Integrated Multicore Com-munication Processor Family ReferenceManual, Rev. 0 11/2011

[RFC2698] J. Heinanen, Telia Finland, R.Guerin A Two Rate Three Color Marker,1999

[E500BSD] Rafał Jaworowski, FreeBSD onhigh performance multi-core embeddedPowerPC systems, 2009

13