RT2014_TCPLA_Nara_27052014-V1

Achieving High Performance with TCP over 40GbE on NUMA Architectures for CMS Data Acquisition

Andrea Petrucci – CERN (PH/CMD)

The 19th Real Time Conference27 May 2014, Nara, Japan

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 2

Outline

New Processor Architectures

DAQ at the CMS Experiment

CMS Online Software Framework (XDAQ) Architecture Foundation Memory and Thread Managements Data Transmission

TCPLA - TCP Layered Architecture Motivation Architecture

Performance Tuning

Preliminary Results

Summary


New Processor Architectures

In middle of 2000’s the processor frequency stabilized and the number of cores per processor started to increase

The “golden” era for software developers ended in 2004 when Intel cancelled its high-performance uniprocessor projects “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in software”

(Herb Sutter, 2005)

Software designed with concurrency in mind resulted in more efficient use of new processors

Processor evolution (source: Sam Naffziger, AMD)


Non-Uniform Memory Access (NUMA)

Non-Uniform Memory Access (NUMA)

The distance between processing cores and memory or I/O interrupts varies within Non-Uniform Memory Access (NUMA) architectures

CPUM

em

ory

C

ontr

oll

er CPU

Mem

ory

C

ontro

ller

CPU

Mem

ory

C

ontr

oll

er CPU

Mem

ory

C

ontro

ller

I/O Device

s

I/O Device

s

I/O Device

s

I/O Device

s

CMS Experiment


CMS DAQ Requirements for LHC Run 2

Parameters

Data Sources (FEDs) ~ 620

Trigger levels 2

First Level rate 100 kHz

Event size 1 to 2 MB

Readout Throughput 200 GB/s

High Level Trigger 1 kHz

Storage Bandwidth 2 GB/s


CMS DAQ System for LHC Run 2

CPUSocket

CPUSocket I/OI/O

40 Gb/sEthernet

56 Gb/sInfiniband FDR

Readout Unit

CMS Online Software Framework (XDAQ)


XDAQ Framework

The XDAQ is software platform created specifically for the development of distributed data acquisition systems

Implemented in C++, developed by the CMS DAQ group

Provides platform independent services, tools for inter-process communication, configuration and control

Builds upon industrial standards, open protocols and libraries, and is designed according to the object-oriented model

For further information about XDAQ see:

J. Gutleber, S. Murray and L. Orsini, Towards a homogeneous architecture for high-energy physics data acquisition systems published in Computer Physics Communications, vol. 153, issue 2, pp. 155-163, 2003 http://www.sciencedirect.com/science/article/pii/S0010465503001619

http://www.sciencedirect.com/science/article/pii/S0010465503001619






XDAQ Architecture Foundation

CoreExecutive

PluginInterface


ApplicationPlugin

CoreExecutive

PluginInterface


XMLConfiguration



XMLConfiguration

SOAP

HTTP

RESTful

Control and Interface

ApplicationPlugin

CoreExecutive

PluginInterface



XMLConfiguration

SOAP

HTTP

RESTful


ApplicationPlugin

CoreExecutive

PluginInterface

Hardware Access

VMEPCI Memory



XMLConfiguration

SOAP

HTTP

RESTful


ApplicationPlugin

CoreExecutive

PluginInterface

Hardware Access

VMEPCI Memory

Protocols and Formats

Uniform building blocks - One or more executives per computer contain application and service components


CPU

Memory – Memory Pools

Memory Pool

Cached buffers

All memory allocation can bound to specific NUMA node

Memory pools allocate and cache memory log2 best-fit allocation

No fragmentation of memory over long runs

Buffers are recycled during runs – constant time for retrieval


Memory – Buffer Loaning

Step 1 Step 2 Step 3

Task A Task B

Loans reference

Allocates reference

Releases reference

Buffer loaning allows zero-copy of data between software layers and processes

Task A Task B Task A Task B


I/O Controller

I/O Controller

CPU

Multithreading

WorkloopThread

Application Thread

Workloops can be bound to run on specific CPU cores by configuration

Work assigned by application

Workloops provide easy use of threads

Assignment of work


Protocol level

XDAQ

User Application

Logical communication

NIC

Peer Transport

Data Transmission – Peer Transports

XDAQ

User Application

NIC

Peer Transport

Peer Transport

Application uses network transparently through XDAQ

framework

User application is network and protocol independent

Routing defined by XDAQ configuration

Connections setup through Peer to Peer model

TCP Layered Architecture


Motivation

Learned lessons from experience using previous peer transports for TCP in a real environment lead to a more advanced design to cope with Separation of TCP socket handling and XDAQ protocols Exploitation of a multi-threading environment Optimisation through configurability to improve

performance Grouping sockets and also to associate different threads

to different roles


TCPLA inspired by uDAPLDirect Access Programming Library

Developed by DAT collaborative http://www.datcollaborative.org/

Defines user Direct Access Programming Library (uDAPL) and kernel Direct Access Programming Library (kDAPL) APIs

TCPLA is inspired by uDAPL specification, in particular the send/receive semantics it describes

http://www.datcollaborative.org/


What is the TCP Layered Architecture

TCP Layered Architecture (TCPLA) is a lightweight, transport and platform independent user-level library for handling socket processing

Provides the user with an event driven model of handling network communications

All calls to send and receive data are performed asynchronously

Two peer transports have been developed ptFRL is specific for reading TCP/IP streams from FEROLs in RU

machine ptUTCP is for general purpose data flow

ptUTCP

General protocols

ptFRL

FEROL protocol


TCPLA Architecture Principles (I)

Send Queue

Receive Queue

Completion Event Queue Received Buffer Connection Request Sent Buffer Connection established

Peer Transport

Pre-allocated buffers for receiving

Filled buffers for sending

Event handler

TCP Layered Architecture


TCPLA Architecture Principles (II)

socket i/osocket i/o

socket i/osocket i/o

socket i/osocket i/osocket i/o

Receive Workloop

Send Workloop

Event WorkloopAssociate 1..N socket per workloop

Performance Tuning


Performance Factors

CPU affinity

I/O Interrupt affinity

Memory affinity

TCP Custom Kernel Settings


Readout Unit Machine Affinities

Affinities

40 Gb/sEthernet

56 Gb/sInfiniband

I/O I/O

Readout Unit

151311 9 7 5 3 1

CPU Socket 1

NUMA Node 1 (16 GB)

141210 8 6 4 2 0

CPU Socket 0

NUMA Node 0 (16 GB)

40 GbE - I/O Interrupts

40 GbE - Socket reading

Infiniband - I/O Interrupts

DAQ threads

DAQ Memory allocation

Cores available to operating system

Preliminary Results


P2P Example: Out of the Box vs Affinity

Lin

k sa

tura

tion

wit

h a

ffinit

y


FEROL Aggregation into One RU

Aggregation n-to-1, for example 12 Streams from 12 FEROLs each

sending fragment size from 2 to 4 kB 16 Streams from 8 FEROLs each

sending fragment size from 1 to 2 kB

Concentrated in one 40 GbE NIC in RU PC

Mellanox SX 1024 with 48 ports at 10 GbE and 12 ports at 40 GbE

Reliability and congestion handled by TCP/IP Post frames are enabled in the 10/40 GbE

switch

12/16 Streams


DAQ Test Bed

Final performance indicator of the TCPLA is based upon the performance achieved while executing simultaneous input and output in the RU

Test bed up to 47 FEROLs (10 GbE) in input

CPU CPU I/OI/O40 Gb/sEthernet

56 Gb/sInfiniband FDR

Readout Unit


Test with 1Data Source from FEROL (I)

40 GE


Test with 1Data Source from FEROL (II)

40 GE

working range

90% efficiency of 40 GEat 4.5 GB/s

Test with 2 Data Sources from FEROL


Test with 2 Data Sources from FEROL

40 GE

working range

90% efficiency of 40 GEat 4.5 GB/s


Summary

TCPLA is based on the standard socket library with optimizations for NUMA environments using the XDAQ framework

It has been developed to allow high performance of critical applications in the new CMS DAQ system by exploiting multi-core architectures

It will be used for both data flow and monitoring


Questions ?

Send Timeout

Receive Timeout

Affinity

Single-Threaded Dispatcher

Multi-Threaded Dispatcher

Poll

Select

Subnet Scanning

Port Scanning

TCPSDP

Blocking

Non-Blocking

Connect On Request

Polling CycleDatagram

Auto Connect

FRL

I2O

B2IN

I/O Queue Size

Polling Work Loop

Waiting Work Loop

Event Queue Size

Receive Buffers

Receive Block Size

Max Clients

Max FD

TCPLA is a Powerful Tool

Backup Slides


Software Architecture

PeerTransport

EventHandler InterfaceAdapter

EndPoint

SelectPoll

PublicServicePoint

Poll SelectDispatcher


Connecting

EventHandlerEventQueueInterfaceAdapte

r

ConnectConnect

Wait for establishment

Push established event

Consume event

a

a

ml

a

Thread

Mutex-less

Asynchronous

Legend:

ml ml


Sending

Push sent event

EventHandlerEventQueueInterfaceAdapte

r

Push into send queueBuffer to send

Check sending is possible

Consume event

Pop buffer from send queue

Send buffer

a

a

a

a ml

a

Thread

Mutex-less

Asynchronous

Legend:

ml ml


Receiving

Dispatch as reference

Check for incoming

EventHandlerEventQueue Dispatcher

Incoming packet

Push receive event Consume event

Provide free buffer

DispatcherPublicServicePo

int

Get free buffer

Socket receive

a

a

a

ml ml

ml

a

Thread

Mutex-less

Asynchronous

Legend:


Readout Unit Machine Affinities

Affinities

40 Gb/sEthernet

56 Gb/sInfiniband

I/O I/O

Readout Unit

151311 9 7 5 3 1

CPU Socket 1

NUMA Node 1 (16 GB)

141210 8 6 4 2 0

CPU Socket 0

NUMA Node 0 (16 GB)

40 GbE - I/O Interrupts

40 GbE - Socket reading

Infiniband - I/O Interrupts

DAQ threads

DAQ Memory allocation

Cores available to operating system

Documents

RT2014_TCPLA_Nara_27052014-V1